- Academic Search

L Huang, W Wang, ZF Wu, Y Shi, H Dou… - arxiv preprint arxiv …, 2024 - arxiv.org

Recent research arxiv: 2410.15027 has explored the use of diffusion transformers (DiTs) for
task-agnostic image generation by simply concatenating attention tokens across images …

Save Cite Cited by 6 Related articles All 2 versions Free GPT-4 View as HTML

[Free GPT-4]

[PDF] arxiv.org

Ifadapter: Instance feature control for grounded text-to-image generation

Y Wu, X Zhou, B Ma, X Su, K Ma, X Wang - arxiv preprint arxiv …, 2024 - arxiv.org

While Text-to-Image (T2I) diffusion models excel at generating visually appealing images of
individual instances, they struggle to accurately position and control the features generation …

Save Cite Cited by 6 Related articles All 3 versions Free GPT-4 View as HTML

[Free GPT-4]

[PDF] openreview.net

Bigger is not always better: Scaling properties of latent diffusion models

K Mei, Z Tu, M Delbracio, H Talebi… - … on Machine Learning …, 2024 - openreview.net

We study the scaling properties of latent diffusion models (LDMs) with an emphasis on their
sampling efficiency. While improved network architecture and inference algorithms have …

Save Cite Cited by 7 Related articles All 2 versions Free GPT-4 View as HTML

[Free GPT-4]

[PDF] arxiv.org

GRID: Visual Layout Generation

C Wan, X Luo, Z Cai, Y Song, Y Zhao, Y Bai… - arxiv preprint arxiv …, 2024 - arxiv.org

In this paper, we introduce GRID, a novel paradigm that reframes a broad range of visual
generation tasks as the problem of arranging grids, akin to film strips. At its core, GRID …

Save Cite Cited by 2 Related articles All 2 versions Free GPT-4 View as HTML

[Free GPT-4]

[PDF] arxiv.org

Sparkle: Mastering basic spatial capabilities in vision language models elicits generalization to composite spatial reasoning

Y Tang, A Qu, Z Wang, D Zhuang, Z Wu, W Ma… - arxiv preprint arxiv …, 2024 - arxiv.org

Vision language models (VLMs) have demonstrated impressive performance across a wide
range of downstream tasks. However, their proficiency in spatial reasoning remains limited …

Save Cite Cited by 2 Related articles All 3 versions Free GPT-4 View as HTML

[Free GPT-4]

[PDF] arxiv.org

Generate Any Scene: Evaluating and Improving Text-to-Vision Generation with Scene Graph Programming

Z Gao, W Huang, J Zhang, A Kembhavi… - arxiv preprint arxiv …, 2024 - arxiv.org

DALL-E and Sora have gained attention by producing implausible images, such as"
astronauts riding a horse in space." Despite the proliferation of text-to-vision models that …

[Free GPT-4]

[PDF] arxiv.org

ChatDiT: A Training-Free Baseline for Task-Agnostic Free-Form Chatting with Diffusion Transformers

L Huang, W Wang, ZF Wu, Y Shi, C Liang… - arxiv preprint arxiv …, 2024 - arxiv.org

Recent research arxiv: 2410.15027 arxiv: 2410.23775 has highlighted the inherent in-
context generation capabilities of pretrained diffusion transformers (DiTs), enabling them to …

[Free GPT-4]

[PDF] arxiv.org

-VAE: Denoising as Visual Decoding

L Zhao, S Woo, Z Wan, Y Li, H Zhang, B Gong… - arxiv preprint arxiv …, 2024 - arxiv.org

In generative modeling, tokenization simplifies complex data into compact, structured
representations, creating a more efficient, learnable space. For high-dimensional visual …

[Free GPT-4]

[PDF] arxiv.org

One Diffusion to Generate Them All

DH Le, T Pham, S Lee, C Clark, A Kembhavi… - arxiv preprint arxiv …, 2024 - arxiv.org

We introduce OneDiffusion, a versatile, large-scale diffusion model that seamlessly supports
bidirectional image synthesis and understanding across diverse tasks. It enables conditional …

[Free GPT-4]

[PDF] arxiv.org

FreeSim: Toward Free-viewpoint Camera Simulation in Driving Scenes

L Fan, H Zhang, Q Wang, H Li, Z Zhang - arxiv preprint arxiv:2412.03566, 2024 - arxiv.org

We propose FreeSim, a camera simulation method for autonomous driving. FreeSim
emphasizes high-quality rendering from viewpoints beyond the recorded ego trajectories. In …

Create alert

Cite

Advanced search

Saved to My library

Imagen 3

In-context lora for diffusion transformers

Ifadapter: Instance feature control for grounded text-to-image generation

Bigger is not always better: Scaling properties of latent diffusion models

GRID: Visual Layout Generation

Sparkle: Mastering basic spatial capabilities in vision language models elicits generalization to composite spatial reasoning

Generate Any Scene: Evaluating and Improving Text-to-Vision Generation with Scene Graph Programming

ChatDiT: A Training-Free Baseline for Task-Agnostic Free-Form Chatting with Diffusion Transformers

-VAE: Denoising as Visual Decoding

One Diffusion to Generate Them All

FreeSim: Toward Free-viewpoint Camera Simulation in Driving Scenes