Emu3: Next-token prediction is all you need

X Wang, X Zhang, Z Luo, Q Sun, Y Cui, J Wang… - arxiv preprint arxiv …, 2024 - arxiv.org
While next-token prediction is considered a promising path towards artificial general
intelligence, it has struggled to excel in multimodal tasks, which are still dominated by …

Videoscore: Building automatic metrics to simulate fine-grained human feedback for video generation

X He, D Jiang, G Zhang, M Ku, A Soni, S Siu… - arxiv preprint arxiv …, 2024 - arxiv.org
The recent years have witnessed great advances in video generation. However, the
development of automatic video metrics is lagging significantly behind. None of the existing …

Pyramidal flow matching for efficient video generative modeling

Y **, Z Sun, N Li, K Xu, H Jiang, N Zhuang… - arxiv preprint arxiv …, 2024 - arxiv.org
Video generation requires modeling a vast spatiotemporal space, which demands
significant computational resources and data usage. To reduce the complexity, the …

T2v-turbo-v2: Enhancing video generation model post-training through data, reward, and conditional guidance design

J Li, Q Long, J Zheng, X Gao, R Piramuthu… - arxiv preprint arxiv …, 2024 - arxiv.org
In this paper, we focus on enhancing a diffusion-based text-to-video (T2V) model during the
post-training phase by distilling a highly capable consistency model from a pretrained T2V …

Improving dynamic object interactions in text-to-video generation with ai feedback

H Furuta, H Zen, D Schuurmans, A Faust… - arxiv preprint arxiv …, 2024 - arxiv.org
Large text-to-video models hold immense potential for a wide range of downstream
applications. However, these models struggle to accurately depict dynamic object …

Videodpo: Omni-preference alignment for video diffusion generation

R Liu, H Wu, Z Ziqiang, C Wei, Y He, R Pi… - arxiv preprint arxiv …, 2024 - arxiv.org
Recent progress in generative diffusion models has greatly advanced text-to-video
generation. While text-to-video models trained on large-scale, diverse datasets can produce …

From slow bidirectional to fast causal video generators

T Yin, Q Zhang, R Zhang, WT Freeman… - arxiv preprint arxiv …, 2024 - arxiv.org
Current video diffusion models achieve impressive generation quality but struggle in
interactive applications due to bidirectional attention dependencies. The generation of a …

Onlinevpo: Align video diffusion model with online video-centric preference optimization

J Zhang, J Wu, W Chen, Y Ji, X **ao, W Huang… - arxiv preprint arxiv …, 2024 - arxiv.org
In recent years, the field of text-to-video (T2V) generation has made significant strides.
Despite this progress, there is still a gap between theoretical advancements and practical …

Lift: Leveraging human feedback for text-to-video model alignment

Y Wang, Z Tan, J Wang, X Yang, C **, H Li - arxiv preprint arxiv …, 2024 - arxiv.org
Recent advancements in text-to-video (T2V) generative models have shown impressive
capabilities. However, these models are still inadequate in aligning synthesized videos with …

BlobGEN-Vid: Compositional Text-to-Video Generation with Blob Video Representations

W Feng, C Liu, S Liu, WY Wang, A Vahdat… - arxiv preprint arxiv …, 2025 - arxiv.org
Existing video generation models struggle to follow complex text prompts and synthesize
multiple objects, raising the need for additional grounding input for improved controllability …