An overview of diffusion models: Applications, guided generation, statistical rates and optimization

M Chen, S Mei, J Fan, M Wang - arxiv preprint arxiv:2404.07771, 2024‏ - arxiv.org
Diffusion models, a powerful and universal generative AI technology, have achieved
tremendous success in computer vision, audio, reinforcement learning, and computational …

The llama 3 herd of models

A Dubey, A Jauhri, A Pandey, A Kadian… - arxiv preprint arxiv …, 2024‏ - arxiv.org
Modern artificial intelligence (AI) systems are powered by foundation models. This paper
presents a new set of foundation models, called Llama 3. It is a herd of language models …

Lavie: High-quality video generation with cascaded latent diffusion models

Y Wang, X Chen, X Ma, S Zhou, Z Huang… - International Journal of …, 2024‏ - Springer
This work aims to learn a high-quality text-to-video (T2V) generative model by leveraging a
pre-trained text-to-image (T2I) model as a basis. It is a highly desirable yet challenging task …

Opportunities and challenges of diffusion models for generative AI

M Chen, S Mei, J Fan, M Wang - National Science Review, 2024‏ - academic.oup.com
Diffusion models, a powerful and universal generative artificial intelligence technology, have
achieved tremendous success and opened up new possibilities in diverse applications. In …

Fast high-resolution image synthesis with latent adversarial diffusion distillation

A Sauer, F Boesel, T Dockhorn, A Blattmann… - SIGGRAPH Asia 2024 …, 2024‏ - dl.acm.org
Diffusion models are the main driver of progress in image and video synthesis, but suffer
from slow inference speed. Distillation methods, like the recently introduced adversarial …

Miradata: A large-scale video dataset with long durations and structured captions

X Ju, Y Gao, Z Zhang, Z Yuan… - Advances in …, 2025‏ - proceedings.neurips.cc
Sora's high-motion intensity and long consistent videos have significantly impacted the field
of video generation, attracting unprecedented attention. However, existing publicly available …

Show-o: One single transformer to unify multimodal understanding and generation

J **e, W Mao, Z Bai, DJ Zhang, W Wang, KQ Lin… - arxiv preprint arxiv …, 2024‏ - arxiv.org
We present a unified transformer, ie, Show-o, that unifies multimodal understanding and
generation. Unlike fully autoregressive models, Show-o unifies autoregressive and …

Discrete flow matching

I Gat, T Remez, N Shaul, F Kreuk… - Advances in …, 2025‏ - proceedings.neurips.cc
Abstract Despite Flow Matching and diffusion models having emerged as powerful
generative paradigms for continuous variables such as images and videos, their application …

Emu3: Next-token prediction is all you need

X Wang, X Zhang, Z Luo, Q Sun, Y Cui, J Wang… - arxiv preprint arxiv …, 2024‏ - arxiv.org
While next-token prediction is considered a promising path towards artificial general
intelligence, it has struggled to excel in multimodal tasks, which are still dominated by …

On statistical rates and provably efficient criteria of latent diffusion transformers (dits)

JYC Hu, W Wu, Z Li, S Pi, Z Song… - Advances in Neural …, 2025‏ - proceedings.neurips.cc
We investigate the statistical and computational limits of latent Diffusion Transformers (DiTs)
under the low-dimensional linear latent space assumption. Statistically, we study the …