Lumina-mgpt: Illuminate flexible photorealistic text-to-image generation with multimodal generative pretraining

D Liu, S Zhao, L Zhuo, W Lin, Y Qiao, H Li… - arxiv preprint arxiv …, 2024 - arxiv.org
We present Lumina-mGPT, a family of multimodal autoregressive models capable of various
vision and language tasks, particularly excelling in generating flexible photorealistic images …

Omnigen: Unified image generation

S **ao, Y Wang, J Zhou, H Yuan, X **ng, R Yan… - arxiv preprint arxiv …, 2024 - arxiv.org
In this work, we introduce OmniGen, a new diffusion model for unified image generation.
Unlike popular diffusion models (eg, Stable Diffusion), OmniGen no longer requires …

Vd3d: Taming large video diffusion transformers for 3d camera control

S Bahmani, I Skorokhodov, A Siarohin… - arxiv preprint arxiv …, 2024 - arxiv.org
Modern text-to-video synthesis models demonstrate coherent, photorealistic generation of
complex videos from a text description. However, most existing models lack fine-grained …

Chronomagic-bench: A benchmark for metamorphic evaluation of text-to-time-lapse video generation

S Yuan, J Huang, Y Xu, Y Liu, S Zhang, Y Shi… - arxiv preprint arxiv …, 2024 - arxiv.org
We propose a novel text-to-video (T2V) generation benchmark, ChronoMagic-Bench, to
evaluate the temporal and metamorphic capabilities of the T2V models (eg Sora and …

Mars: Mixture of auto-regressive models for fine-grained text-to-image synthesis

W He, S Fu, M Liu, X Wang, W **ao, F Shu… - arxiv preprint arxiv …, 2024 - arxiv.org
Auto-regressive models have made significant progress in the realm of language
generation, yet they do not perform on par with diffusion models in the domain of image …

Efficient diffusion models: A comprehensive survey from principles to practices

Z Ma, Y Zhang, G Jia, L Zhao, Y Ma, M Ma… - arxiv preprint arxiv …, 2024 - arxiv.org
As one of the most popular and sought-after generative models in the recent years, diffusion
models have sparked the interests of many researchers and steadily shown excellent …

Venhancer: Generative space-time enhancement for video generation

J He, T Xue, D Liu, X Lin, P Gao, D Lin, Y Qiao… - arxiv preprint arxiv …, 2024 - arxiv.org
We present VEnhancer, a generative space-time enhancement framework that improves the
existing text-to-video results by adding more details in spatial domain and synthetic detailed …

Monoformer: One transformer for both diffusion and autoregression

C Zhao, Y Song, W Wang, H Feng, E Ding… - arxiv preprint arxiv …, 2024 - arxiv.org
Most existing multimodality methods use separate backbones for autoregression-based
discrete text generation and diffusion-based continuous visual generation, or the same …

Scaling diffusion transformers to 16 billion parameters

Z Fei, M Fan, C Yu, D Li, J Huang - arxiv preprint arxiv:2407.11633, 2024 - arxiv.org
In this paper, we present DiT-MoE, a sparse version of the diffusion Transformer, that is
scalable and competitive with dense networks while exhibiting highly optimized inference …

Mardini: Masked autoregressive diffusion for video generation at scale

H Liu, S Liu, Z Zhou, M Xu, Y **e, X Han… - arxiv preprint arxiv …, 2024 - arxiv.org
We introduce MarDini, a new family of video diffusion models that integrate the advantages
of masked auto-regression (MAR) into a unified diffusion model (DM) framework. Here, MAR …