Open-sora: Democratizing efficient video production for all
Vision and language are the two foundational senses for humans, and they build up our
cognitive ability and intelligence. While significant breakthroughs have been made in AI …
cognitive ability and intelligence. While significant breakthroughs have been made in AI …
Emu3: Next-token prediction is all you need
While next-token prediction is considered a promising path towards artificial general
intelligence, it has struggled to excel in multimodal tasks, which are still dominated by …
intelligence, it has struggled to excel in multimodal tasks, which are still dominated by …
Dreamlip: Language-image pre-training with long captions
Abstract Language-image pre-training largely relies on how precisely and thoroughly a text
describes its paired image. In practice, however, the contents of an image can be so rich that …
describes its paired image. In practice, however, the contents of an image can be so rich that …
Miradata: A large-scale video dataset with long durations and structured captions
Sora's high-motion intensity and long consistent videos have significantly impacted the field
of video generation, attracting unprecedented attention. However, existing publicly available …
of video generation, attracting unprecedented attention. However, existing publicly available …
Lumina-mgpt: Illuminate flexible photorealistic text-to-image generation with multimodal generative pretraining
We present Lumina-mGPT, a family of multimodal autoregressive models capable of various
vision and language tasks, particularly excelling in generating flexible photorealistic images …
vision and language tasks, particularly excelling in generating flexible photorealistic images …
Representation alignment for generation: Training diffusion transformers is easier than you think
Recent studies have shown that the denoising process in (generative) diffusion models can
induce meaningful (discriminative) representations inside the model, though the quality of …
induce meaningful (discriminative) representations inside the model, though the quality of …
Efficient diffusion transformer with step-wise dynamic attention mediators
This paper identifies significant redundancy in the query-key interactions within self-attention
mechanisms of diffusion transformer models, particularly during the early stages of …
mechanisms of diffusion transformer models, particularly during the early stages of …
Deep compression autoencoder for efficient high-resolution diffusion models
We present Deep Compression Autoencoder (DC-AE), a new family of autoencoder models
for accelerating high-resolution diffusion models. Existing autoencoder models have …
for accelerating high-resolution diffusion models. Existing autoencoder models have …
Svdqunat: Absorbing outliers by low-rank components for 4-bit diffusion models
Diffusion models have been proven highly effective at generating high-quality images.
However, as these models grow larger, they require significantly more memory and suffer …
However, as these models grow larger, they require significantly more memory and suffer …
Open-sora plan: Open-source large video generation model
We introduce Open-Sora Plan, an open-source project that aims to contribute a large
generation model for generating desired high-resolution videos with long durations based …
generation model for generating desired high-resolution videos with long durations based …