- Academic Search

Y Fan, X Ma, R Wu, Y Du, J Li, Z Gao, Q Li - European Conference on …, 2024 - Springer

We explore how reconciling several foundation models (large language models and vision-
language models) with a novel unified memory mechanism could tackle the challenging …

Save Cite Cited by 32 Related articles All 4 versions Free GPT-4

[Free GPT-4]

[PDF] arxiv.org

Egoplan-bench2: A benchmark for multimodal large language model planning in real-world scenarios

L Qiu, Y Ge, Y Chen, Y Ge, Y Shan, X Liu - arxiv preprint arxiv …, 2024 - arxiv.org

The advent of Multimodal Large Language Models, leveraging the power of Large
Language Models, has recently demonstrated superior multimodal understanding and …

Save Cite Cited by 2 Related articles View as HTML

[Free GPT-4]

[PDF] arxiv.org

Video-MMMU: Evaluating Knowledge Acquisition from Multi-Discipline Professional Videos

K Hu, P Wu, F Pu, W **ao, Y Zhang, X Yue, B Li… - arxiv preprint arxiv …, 2025 - arxiv.org

Humans acquire knowledge through three cognitive stages: perceiving information,
comprehending knowledge, and adapting knowledge to solve novel problems. Videos serve …

Save Cite Related articles View as HTML

[Free GPT-4]

[PDF] arxiv.org

LongViTU: Instruction Tuning for Long-Form Video Understanding

R Wu, X Ma, H Ci, Y Fan, Y Wang, H Zhao, Q Li… - arxiv preprint arxiv …, 2025 - arxiv.org

This paper introduce LongViTU, a large-scale (~ 121k QA pairs,~ 900h videos),
automatically generated dataset for long-form video understanding. We developed a …

Create alert

Cite

Advanced search

Saved to My library

WorldQA: Multimodal World Knowledge in Videos through Long-Chain Reasoning

VideoAgent: A Memory-Augmented Multimodal Agent for Video Understanding

Egoplan-bench2: A benchmark for multimodal large language model planning in real-world scenarios

Video-MMMU: Evaluating Knowledge Acquisition from Multi-Discipline Professional Videos

LongViTU: Instruction Tuning for Long-Form Video Understanding