In-context lora for diffusion transformers
Recent research arxiv: 2410.15027 has explored the use of diffusion transformers (DiTs) for
task-agnostic image generation by simply concatenating attention tokens across images …
task-agnostic image generation by simply concatenating attention tokens across images …
Ifadapter: Instance feature control for grounded text-to-image generation
While Text-to-Image (T2I) diffusion models excel at generating visually appealing images of
individual instances, they struggle to accurately position and control the features generation …
individual instances, they struggle to accurately position and control the features generation …
Bigger is not always better: Scaling properties of latent diffusion models
We study the scaling properties of latent diffusion models (LDMs) with an emphasis on their
sampling efficiency. While improved network architecture and inference algorithms have …
sampling efficiency. While improved network architecture and inference algorithms have …
GRID: Visual Layout Generation
In this paper, we introduce GRID, a novel paradigm that reframes a broad range of visual
generation tasks as the problem of arranging grids, akin to film strips. At its core, GRID …
generation tasks as the problem of arranging grids, akin to film strips. At its core, GRID …
Sparkle: Mastering basic spatial capabilities in vision language models elicits generalization to composite spatial reasoning
Vision language models (VLMs) have demonstrated impressive performance across a wide
range of downstream tasks. However, their proficiency in spatial reasoning remains limited …
range of downstream tasks. However, their proficiency in spatial reasoning remains limited …
Generate Any Scene: Evaluating and Improving Text-to-Vision Generation with Scene Graph Programming
DALL-E and Sora have gained attention by producing implausible images, such as"
astronauts riding a horse in space." Despite the proliferation of text-to-vision models that …
astronauts riding a horse in space." Despite the proliferation of text-to-vision models that …
ChatDiT: A Training-Free Baseline for Task-Agnostic Free-Form Chatting with Diffusion Transformers
L Huang, W Wang, ZF Wu, Y Shi, C Liang… - arxiv preprint arxiv …, 2024 - arxiv.org
Recent research arxiv: 2410.15027 arxiv: 2410.23775 has highlighted the inherent in-
context generation capabilities of pretrained diffusion transformers (DiTs), enabling them to …
context generation capabilities of pretrained diffusion transformers (DiTs), enabling them to …
-VAE: Denoising as Visual Decoding
In generative modeling, tokenization simplifies complex data into compact, structured
representations, creating a more efficient, learnable space. For high-dimensional visual …
representations, creating a more efficient, learnable space. For high-dimensional visual …
One Diffusion to Generate Them All
We introduce OneDiffusion, a versatile, large-scale diffusion model that seamlessly supports
bidirectional image synthesis and understanding across diverse tasks. It enables conditional …
bidirectional image synthesis and understanding across diverse tasks. It enables conditional …
FreeSim: Toward Free-viewpoint Camera Simulation in Driving Scenes
We propose FreeSim, a camera simulation method for autonomous driving. FreeSim
emphasizes high-quality rendering from viewpoints beyond the recorded ego trajectories. In …
emphasizes high-quality rendering from viewpoints beyond the recorded ego trajectories. In …