A survey on video diffusion models

Z **ng, Q Feng, H Chen, Q Dai, H Hu, H Xu… - ACM Computing …, 2024 - dl.acm.org
The recent wave of AI-generated content (AIGC) has witnessed substantial success in
computer vision, with the diffusion model playing a crucial role in this achievement. Due to …

Is sora a world simulator? a comprehensive survey on general world models and beyond

Z Zhu, X Wang, W Zhao, C Min, N Deng, M Dou… - arxiv preprint arxiv …, 2024 - arxiv.org
General world models represent a crucial pathway toward achieving Artificial General
Intelligence (AGI), serving as the cornerstone for various applications ranging from virtual …

Sharegpt4video: Improving video understanding and generation with better captions

L Chen, X Wei, J Li, X Dong, P Zhang, Y Zang… - arxiv preprint arxiv …, 2024 - arxiv.org
We present the ShareGPT4Video series, aiming to facilitate the video understanding of large
video-language models (LVLMs) and the video generation of text-to-video models (T2VMs) …

Miradata: A large-scale video dataset with long durations and structured captions

X Ju, Y Gao, Z Zhang, Z Yuan… - Advances in …, 2025 - proceedings.neurips.cc
Sora's high-motion intensity and long consistent videos have significantly impacted the field
of video generation, attracting unprecedented attention. However, existing publicly available …

Streaming long video understanding with large language models

R Qian, X Dong, P Zhang, Y Zang… - Advances in …, 2025 - proceedings.neurips.cc
This paper presents VideoStreaming, an advanced vision-language large model (VLLM) for
video understanding, that capably understands arbitrary-length video with a constant …

Hourvideo: 1-hour video-language understanding

K Chandrasegaran, A Gupta… - Advances in …, 2025 - proceedings.neurips.cc
We present HourVideo, a benchmark dataset for hour-long video-language understanding.
Our dataset consists of a novel task suite comprising summarization, perception (recall …

Video instruction tuning with synthetic data

Y Zhang, J Wu, W Li, B Li, Z Ma, Z Liu, C Li - arxiv preprint arxiv …, 2024 - arxiv.org
The development of video large multimodal models (LMMs) has been hindered by the
difficulty of curating large amounts of high-quality raw data from the web. To address this, we …

Open-sora: Democratizing efficient video production for all

Z Zheng, X Peng, T Yang, C Shen, S Li, H Liu… - arxiv preprint arxiv …, 2024 - arxiv.org
Vision and language are the two foundational senses for humans, and they build up our
cognitive ability and intelligence. While significant breakthroughs have been made in AI …

Kangaroo: A powerful video-language model supporting long-context video input

J Liu, Y Wang, H Ma, X Wu, X Ma, X Wei, J Jiao… - arxiv preprint arxiv …, 2024 - arxiv.org
Rapid advancements have been made in extending Large Language Models (LLMs) to
Large Multi-modal Models (LMMs). However, extending input modality of LLMs to video data …

A survey of multimodal large language model from a data-centric perspective

T Bai, H Liang, B Wan, Y Xu, X Li, S Li, L Yang… - arxiv preprint arxiv …, 2024 - arxiv.org
Multimodal large language models (MLLMs) enhance the capabilities of standard large
language models by integrating and processing data from multiple modalities, including text …