PIXART-: Weak-to-Strong Training of Diffusion Transformer for 4K Text-to-Image Generation

J Chen, C Ge, E **e, Y Wu, L Yao, X Ren… - … on Computer Vision, 2024 - Springer
In this paper, we introduce PixArt-Σ, a Diffusion Transformer model (DiT) capable of directly
generating images at 4K resolution. PixArt-Σ represents a significant advancement over its …

Diffit: Diffusion vision transformers for image generation

A Hatamizadeh, J Song, G Liu, J Kautz… - European Conference on …, 2024 - Springer
Diffusion models with their powerful expressivity and high sample quality have achieved
State-Of-The-Art (SOTA) performance in the generative domain. The pioneering Vision …

Representation alignment for generation: Training diffusion transformers is easier than you think

S Yu, S Kwak, H Jang, J Jeong, J Huang, J Shin… - arxiv preprint arxiv …, 2024 - arxiv.org
Recent studies have shown that the denoising process in (generative) diffusion models can
induce meaningful (discriminative) representations inside the model, though the quality of …

Efficient diffusion transformer with step-wise dynamic attention mediators

Y Pu, Z **a, J Guo, D Han, Q Li, D Li, Y Yuan… - … on Computer Vision, 2024 - Springer
This paper identifies significant redundancy in the query-key interactions within self-attention
mechanisms of diffusion transformer models, particularly during the early stages of …

Multi-layer transformers gradient can be approximated in almost linear time

Y Liang, Z Sha, Z Shi, Z Song, Y Zhou - arxiv preprint arxiv:2408.13233, 2024 - arxiv.org
The computational complexity of the self-attention mechanism in popular transformer
architectures poses significant challenges for training and inference, and becomes the …

Mora: Enabling generalist video generation via a multi-agent framework

Z Yuan, Y Liu, Y Cao, W Sun, H Jia, R Chen… - arxiv preprint arxiv …, 2024 - arxiv.org
Text-to-video generation has made significant strides, but replicating the capabilities of
advanced systems like OpenAI Sora remains challenging due to their closed-source nature …

On statistical rates and provably efficient criteria of latent diffusion transformers (dits)

JYC Hu, W Wu, Z Song, H Liu - arxiv preprint arxiv:2407.01079, 2024 - arxiv.org
We investigate the statistical and computational limits of latent Diffusion Transformers (DiTs)
under the low-dimensional linear latent space assumption. Statistically, we study the …

Deep compression autoencoder for efficient high-resolution diffusion models

J Chen, H Cai, J Chen, E **e, S Yang, H Tang… - arxiv preprint arxiv …, 2024 - arxiv.org
We present Deep Compression Autoencoder (DC-AE), a new family of autoencoder models
for accelerating high-resolution diffusion models. Existing autoencoder models have …

Efficient diffusion models: A comprehensive survey from principles to practices

Z Ma, Y Zhang, G Jia, L Zhao, Y Ma, M Ma… - arxiv preprint arxiv …, 2024 - arxiv.org
As one of the most popular and sought-after generative models in the recent years, diffusion
models have sparked the interests of many researchers and steadily shown excellent …

Randomized autoregressive visual generation

Q Yu, J He, X Deng, X Shen, LC Chen - arxiv preprint arxiv:2411.00776, 2024 - arxiv.org
This paper presents Randomized AutoRegressive modeling (RAR) for visual generation,
which sets a new state-of-the-art performance on the image generation task while …