Sparse VideoGen: Accelerating Video Diffusion Transformers with Spatial-Temporal Sparsity

H **, S Yang, Y Zhao, C Xu, M Li, X Li, Y Lin… - arxiv preprint arxiv …, 2025 - arxiv.org
Diffusion Transformers (DiTs) dominate video generation but their high computational cost
severely limits real-world applicability, usually requiring tens of minutes to generate a few …

Reconstruction vs. Generation: Taming Optimization Dilemma in Latent Diffusion Models

J Yao, X Wang - arxiv preprint arxiv:2501.01423, 2025 - arxiv.org
Latent diffusion models with Transformer architectures excel at generating high-fidelity
images. However, recent studies reveal an optimization dilemma in this two-stage design …

SnapGen: Taming High-Resolution Text-to-Image Models for Mobile Devices with Efficient Architectures and Training

D Hu, J Chen, X Huang, H Coskun, A Sahni… - arxiv preprint arxiv …, 2024 - arxiv.org
Existing text-to-image (T2I) diffusion models face several limitations, including large model
sizes, slow runtime, and low-quality generation on mobile devices. This paper aims to …

SANA 1.5: Efficient Scaling of Training-Time and Inference-Time Compute in Linear Diffusion Transformer

E **e, J Chen, Y Zhao, J Yu, L Zhu, Y Lin… - arxiv preprint arxiv …, 2025 - arxiv.org
This paper presents SANA-1.5, a linear Diffusion Transformer for efficient scaling in text-to-
image generation. Building upon SANA-1.0, we introduce three key innovations:(1) Efficient …

TinyFusion: Diffusion Transformers Learned Shallow

G Fang, K Li, X Ma, X Wang - arxiv preprint arxiv:2412.01199, 2024 - arxiv.org
Diffusion Transformers have demonstrated remarkable capabilities in image generation but
often come with excessive parameterization, resulting in considerable inference overhead in …

LiT: Delving into a Simplified Linear Diffusion Transformer for Image Generation

J Wang, N Kang, L Yao, M Chen, C Wu… - arxiv preprint arxiv …, 2025 - arxiv.org
In commonly used sub-quadratic complexity modules, linear attention benefits from
simplicity and high parallelism, making it promising for image synthesis tasks. However, the …

Improving the Diffusability of Autoencoders

I Skorokhodov, S Girish, B Hu, W Menapace… - arxiv preprint arxiv …, 2025 - arxiv.org
Latent diffusion models have emerged as the leading approach for generating high-quality
images and videos, utilizing compressed latent representations to reduce the computational …

SnapGen-V: Generating a Five-Second Video within Five Seconds on a Mobile Device

Y Wu, Z Zhang, Y Li, Y Xu, A Kag, Y Sui… - arxiv preprint arxiv …, 2024 - arxiv.org
We have witnessed the unprecedented success of diffusion-based video generation over
the past year. Recently proposed models from the community have wielded the power to …

Mimir: Improving Video Diffusion Models for Precise Text Understanding

S Tan, B Gong, Y Feng, K Zheng, D Zheng… - arxiv preprint arxiv …, 2024 - arxiv.org
Text serves as the key control signal in video generation due to its narrative nature. To
render text descriptions into video clips, current video diffusion models borrow features from …

Magic 1-For-1: Generating One Minute Video Clips within One Minute

H Yi, S Shao, T Ye, J Zhao, Q Yin, M Lingelbach… - arxiv preprint arxiv …, 2025 - arxiv.org
In this technical report, we present Magic 1-For-1 (Magic141), an efficient video generation
model with optimized memory consumption and inference latency. The key idea is simple …