VideoAgent: A Memory-Augmented Multimodal Agent for Video Understanding
We explore how reconciling several foundation models (large language models and vision-
language models) with a novel unified memory mechanism could tackle the challenging …
language models) with a novel unified memory mechanism could tackle the challenging …
Egoplan-bench2: A benchmark for multimodal large language model planning in real-world scenarios
The advent of Multimodal Large Language Models, leveraging the power of Large
Language Models, has recently demonstrated superior multimodal understanding and …
Language Models, has recently demonstrated superior multimodal understanding and …
Video-MMMU: Evaluating Knowledge Acquisition from Multi-Discipline Professional Videos
Humans acquire knowledge through three cognitive stages: perceiving information,
comprehending knowledge, and adapting knowledge to solve novel problems. Videos serve …
comprehending knowledge, and adapting knowledge to solve novel problems. Videos serve …
LongViTU: Instruction Tuning for Long-Form Video Understanding
This paper introduce LongViTU, a large-scale (~ 121k QA pairs,~ 900h videos),
automatically generated dataset for long-form video understanding. We developed a …
automatically generated dataset for long-form video understanding. We developed a …