Emu3: Next-token prediction is all you need
While next-token prediction is considered a promising path towards artificial general
intelligence, it has struggled to excel in multimodal tasks, which are still dominated by …
intelligence, it has struggled to excel in multimodal tasks, which are still dominated by …
Videoscore: Building automatic metrics to simulate fine-grained human feedback for video generation
The recent years have witnessed great advances in video generation. However, the
development of automatic video metrics is lagging significantly behind. None of the existing …
development of automatic video metrics is lagging significantly behind. None of the existing …
Pyramidal flow matching for efficient video generative modeling
Video generation requires modeling a vast spatiotemporal space, which demands
significant computational resources and data usage. To reduce the complexity, the …
significant computational resources and data usage. To reduce the complexity, the …
T2v-turbo-v2: Enhancing video generation model post-training through data, reward, and conditional guidance design
In this paper, we focus on enhancing a diffusion-based text-to-video (T2V) model during the
post-training phase by distilling a highly capable consistency model from a pretrained T2V …
post-training phase by distilling a highly capable consistency model from a pretrained T2V …
Improving dynamic object interactions in text-to-video generation with ai feedback
Large text-to-video models hold immense potential for a wide range of downstream
applications. However, these models struggle to accurately depict dynamic object …
applications. However, these models struggle to accurately depict dynamic object …
Videodpo: Omni-preference alignment for video diffusion generation
Recent progress in generative diffusion models has greatly advanced text-to-video
generation. While text-to-video models trained on large-scale, diverse datasets can produce …
generation. While text-to-video models trained on large-scale, diverse datasets can produce …
From slow bidirectional to fast causal video generators
Current video diffusion models achieve impressive generation quality but struggle in
interactive applications due to bidirectional attention dependencies. The generation of a …
interactive applications due to bidirectional attention dependencies. The generation of a …
Onlinevpo: Align video diffusion model with online video-centric preference optimization
In recent years, the field of text-to-video (T2V) generation has made significant strides.
Despite this progress, there is still a gap between theoretical advancements and practical …
Despite this progress, there is still a gap between theoretical advancements and practical …
Lift: Leveraging human feedback for text-to-video model alignment
Recent advancements in text-to-video (T2V) generative models have shown impressive
capabilities. However, these models are still inadequate in aligning synthesized videos with …
capabilities. However, these models are still inadequate in aligning synthesized videos with …
BlobGEN-Vid: Compositional Text-to-Video Generation with Blob Video Representations
Existing video generation models struggle to follow complex text prompts and synthesize
multiple objects, raising the need for additional grounding input for improved controllability …
multiple objects, raising the need for additional grounding input for improved controllability …