In-context lora for diffusion transformers

L Huang, W Wang, ZF Wu, Y Shi, H Dou… - arxiv preprint arxiv …, 2024 - arxiv.org
Recent research arxiv: 2410.15027 has explored the use of diffusion transformers (DiTs) for
task-agnostic image generation by simply concatenating attention tokens across images …

Ifadapter: Instance feature control for grounded text-to-image generation

Y Wu, X Zhou, B Ma, X Su, K Ma, X Wang - arxiv preprint arxiv …, 2024 - arxiv.org
While Text-to-Image (T2I) diffusion models excel at generating visually appealing images of
individual instances, they struggle to accurately position and control the features generation …

Bigger is not always better: Scaling properties of latent diffusion models

K Mei, Z Tu, M Delbracio, H Talebi… - … on Machine Learning …, 2024 - openreview.net
We study the scaling properties of latent diffusion models (LDMs) with an emphasis on their
sampling efficiency. While improved network architecture and inference algorithms have …

GRID: Visual Layout Generation

C Wan, X Luo, Z Cai, Y Song, Y Zhao, Y Bai… - arxiv preprint arxiv …, 2024 - arxiv.org
In this paper, we introduce GRID, a novel paradigm that reframes a broad range of visual
generation tasks as the problem of arranging grids, akin to film strips. At its core, GRID …

Sparkle: Mastering basic spatial capabilities in vision language models elicits generalization to composite spatial reasoning

Y Tang, A Qu, Z Wang, D Zhuang, Z Wu, W Ma… - arxiv preprint arxiv …, 2024 - arxiv.org
Vision language models (VLMs) have demonstrated impressive performance across a wide
range of downstream tasks. However, their proficiency in spatial reasoning remains limited …

Generate Any Scene: Evaluating and Improving Text-to-Vision Generation with Scene Graph Programming

Z Gao, W Huang, J Zhang, A Kembhavi… - arxiv preprint arxiv …, 2024 - arxiv.org
DALL-E and Sora have gained attention by producing implausible images, such as"
astronauts riding a horse in space." Despite the proliferation of text-to-vision models that …

ChatDiT: A Training-Free Baseline for Task-Agnostic Free-Form Chatting with Diffusion Transformers

L Huang, W Wang, ZF Wu, Y Shi, C Liang… - arxiv preprint arxiv …, 2024 - arxiv.org
Recent research arxiv: 2410.15027 arxiv: 2410.23775 has highlighted the inherent in-
context generation capabilities of pretrained diffusion transformers (DiTs), enabling them to …

-VAE: Denoising as Visual Decoding

L Zhao, S Woo, Z Wan, Y Li, H Zhang, B Gong… - arxiv preprint arxiv …, 2024 - arxiv.org
In generative modeling, tokenization simplifies complex data into compact, structured
representations, creating a more efficient, learnable space. For high-dimensional visual …

One Diffusion to Generate Them All

DH Le, T Pham, S Lee, C Clark, A Kembhavi… - arxiv preprint arxiv …, 2024 - arxiv.org
We introduce OneDiffusion, a versatile, large-scale diffusion model that seamlessly supports
bidirectional image synthesis and understanding across diverse tasks. It enables conditional …

FreeSim: Toward Free-viewpoint Camera Simulation in Driving Scenes

L Fan, H Zhang, Q Wang, H Li, Z Zhang - arxiv preprint arxiv:2412.03566, 2024 - arxiv.org
We propose FreeSim, a camera simulation method for autonomous driving. FreeSim
emphasizes high-quality rendering from viewpoints beyond the recorded ego trajectories. In …