Lumina-mgpt: Illuminate flexible photorealistic text-to-image generation with multimodal generative pretraining
We present Lumina-mGPT, a family of multimodal autoregressive models capable of various
vision and language tasks, particularly excelling in generating flexible photorealistic images …
vision and language tasks, particularly excelling in generating flexible photorealistic images …
Omnigen: Unified image generation
In this work, we introduce OmniGen, a new diffusion model for unified image generation.
Unlike popular diffusion models (eg, Stable Diffusion), OmniGen no longer requires …
Unlike popular diffusion models (eg, Stable Diffusion), OmniGen no longer requires …
Vd3d: Taming large video diffusion transformers for 3d camera control
Modern text-to-video synthesis models demonstrate coherent, photorealistic generation of
complex videos from a text description. However, most existing models lack fine-grained …
complex videos from a text description. However, most existing models lack fine-grained …
Chronomagic-bench: A benchmark for metamorphic evaluation of text-to-time-lapse video generation
We propose a novel text-to-video (T2V) generation benchmark, ChronoMagic-Bench, to
evaluate the temporal and metamorphic capabilities of the T2V models (eg Sora and …
evaluate the temporal and metamorphic capabilities of the T2V models (eg Sora and …
Mars: Mixture of auto-regressive models for fine-grained text-to-image synthesis
Auto-regressive models have made significant progress in the realm of language
generation, yet they do not perform on par with diffusion models in the domain of image …
generation, yet they do not perform on par with diffusion models in the domain of image …
Efficient diffusion models: A comprehensive survey from principles to practices
Z Ma, Y Zhang, G Jia, L Zhao, Y Ma, M Ma… - arxiv preprint arxiv …, 2024 - arxiv.org
As one of the most popular and sought-after generative models in the recent years, diffusion
models have sparked the interests of many researchers and steadily shown excellent …
models have sparked the interests of many researchers and steadily shown excellent …
Venhancer: Generative space-time enhancement for video generation
We present VEnhancer, a generative space-time enhancement framework that improves the
existing text-to-video results by adding more details in spatial domain and synthetic detailed …
existing text-to-video results by adding more details in spatial domain and synthetic detailed …
Monoformer: One transformer for both diffusion and autoregression
Most existing multimodality methods use separate backbones for autoregression-based
discrete text generation and diffusion-based continuous visual generation, or the same …
discrete text generation and diffusion-based continuous visual generation, or the same …
Scaling diffusion transformers to 16 billion parameters
In this paper, we present DiT-MoE, a sparse version of the diffusion Transformer, that is
scalable and competitive with dense networks while exhibiting highly optimized inference …
scalable and competitive with dense networks while exhibiting highly optimized inference …
Mardini: Masked autoregressive diffusion for video generation at scale
We introduce MarDini, a new family of video diffusion models that integrate the advantages
of masked auto-regression (MAR) into a unified diffusion model (DM) framework. Here, MAR …
of masked auto-regression (MAR) into a unified diffusion model (DM) framework. Here, MAR …