VideoAgent: A Memory-Augmented Multimodal Agent for Video Understanding

Y Fan, X Ma, R Wu, Y Du, J Li, Z Gao, Q Li - European Conference on …, 2024 - Springer
We explore how reconciling several foundation models (large language models and vision-
language models) with a novel unified memory mechanism could tackle the challenging …

Egoplan-bench2: A benchmark for multimodal large language model planning in real-world scenarios

L Qiu, Y Ge, Y Chen, Y Ge, Y Shan, X Liu - arxiv preprint arxiv …, 2024 - arxiv.org
The advent of Multimodal Large Language Models, leveraging the power of Large
Language Models, has recently demonstrated superior multimodal understanding and …

Video-MMMU: Evaluating Knowledge Acquisition from Multi-Discipline Professional Videos

K Hu, P Wu, F Pu, W **ao, Y Zhang, X Yue, B Li… - arxiv preprint arxiv …, 2025 - arxiv.org
Humans acquire knowledge through three cognitive stages: perceiving information,
comprehending knowledge, and adapting knowledge to solve novel problems. Videos serve …

LongViTU: Instruction Tuning for Long-Form Video Understanding

R Wu, X Ma, H Ci, Y Fan, Y Wang, H Zhao, Q Li… - arxiv preprint arxiv …, 2025 - arxiv.org
This paper introduce LongViTU, a large-scale (~ 121k QA pairs,~ 900h videos),
automatically generated dataset for long-form video understanding. We developed a …