Emu3: Next-token prediction is all you need

X Wang, X Zhang, Z Luo, Q Sun, Y Cui, J Wang… - arxiv preprint arxiv …, 2024 - arxiv.org
While next-token prediction is considered a promising path towards artificial general
intelligence, it has struggled to excel in multimodal tasks, which are still dominated by …

Videoscore: Building automatic metrics to simulate fine-grained human feedback for video generation

X He, D Jiang, G Zhang, M Ku, A Soni, S Siu… - arxiv preprint arxiv …, 2024 - arxiv.org
The recent years have witnessed great advances in video generation. However, the
development of automatic video metrics is lagging significantly behind. None of the existing …

Pyramidal flow matching for efficient video generative modeling

Y **, Z Sun, N Li, K Xu, H Jiang, N Zhuang… - arxiv preprint arxiv …, 2024 - arxiv.org
Video generation requires modeling a vast spatiotemporal space, which demands
significant computational resources and data usage. To reduce the complexity, the …

T2v-turbo-v2: Enhancing video generation model post-training through data, reward, and conditional guidance design

J Li, Q Long, J Zheng, X Gao, R Piramuthu… - arxiv preprint arxiv …, 2024 - arxiv.org
In this paper, we focus on enhancing a diffusion-based text-to-video (T2V) model during the
post-training phase by distilling a highly capable consistency model from a pretrained T2V …

Improving dynamic object interactions in text-to-video generation with ai feedback

H Furuta, H Zen, D Schuurmans, A Faust… - arxiv preprint arxiv …, 2024 - arxiv.org
Large text-to-video models hold immense potential for a wide range of downstream
applications. However, these models struggle to accurately depict dynamic object …

From slow bidirectional to fast causal video generators

T Yin, Q Zhang, R Zhang, WT Freeman… - arxiv preprint arxiv …, 2024 - arxiv.org
Current video diffusion models achieve impressive generation quality but struggle in
interactive applications due to bidirectional attention dependencies. The generation of a …

Videodpo: Omni-preference alignment for video diffusion generation

R Liu, H Wu, Z Ziqiang, C Wei, Y He, R Pi… - arxiv preprint arxiv …, 2024 - arxiv.org
Recent progress in generative diffusion models has greatly advanced text-to-video
generation. While text-to-video models trained on large-scale, diverse datasets can produce …

DOLLAR: Few-Step Video Generation via Distillation and Latent Reward Optimization

Z Ding, C **, D Liu, H Zheng, KK Singh… - arxiv preprint arxiv …, 2024 - arxiv.org
Diffusion probabilistic models have shown significant progress in video generation;
however, their computational efficiency is limited by the large number of sampling steps …

Onlinevpo: Align video diffusion model with online video-centric preference optimization

J Zhang, J Wu, W Chen, Y Ji, X **ao, W Huang… - arxiv preprint arxiv …, 2024 - arxiv.org
In recent years, the field of text-to-video (T2V) generation has made significant strides.
Despite this progress, there is still a gap between theoretical advancements and practical …

Lift: Leveraging human feedback for text-to-video model alignment

Y Wang, Z Tan, J Wang, X Yang, C **, H Li - arxiv preprint arxiv …, 2024 - arxiv.org
Recent advancements in text-to-video (T2V) generative models have shown impressive
capabilities. However, these models are still inadequate in aligning synthesized videos with …