Kangaroo: A powerful video-language model supporting long-context video input

J Liu, Y Wang, H Ma, X Wu, X Ma, X Wei, J Jiao… - arxiv preprint arxiv …, 2024 - arxiv.org
Rapid advancements have been made in extending Large Language Models (LLMs) to
Large Multi-modal Models (LMMs). However, extending input modality of LLMs to video data …

Mme-survey: A comprehensive survey on evaluation of multimodal llms

C Fu, YF Zhang, S Yin, B Li, X Fang, S Zhao… - arxiv preprint arxiv …, 2024 - arxiv.org
As a prominent direction of Artificial General Intelligence (AGI), Multimodal Large Language
Models (MLLMs) have garnered increased attention from both industry and academia …

Cinepile: A long video question answering dataset and benchmark

R Rawal, K Saifullah, M Farré, R Basri… - arxiv preprint arxiv …, 2024 - arxiv.org
Current datasets for long-form video understanding often fall short of providing genuine long-
form comprehension challenges, as many tasks derived from these datasets can be …

Apollo: An exploration of video understanding in large multimodal models

O Zohar, X Wang, Y Dubois, N Mehta, T **ao… - arxiv preprint arxiv …, 2024 - arxiv.org
Despite the rapid integration of video perception capabilities into Large Multimodal Models
(LMMs), the underlying mechanisms driving their video understanding remain poorly …

Timemarker: A versatile video-llm for long and short video understanding with superior temporal localization ability

S Chen, X Lan, Y Yuan, Z Jie, L Ma - arxiv preprint arxiv:2411.18211, 2024 - arxiv.org
Rapid development of large language models (LLMs) has significantly advanced multimodal
large language models (LMMs), particularly in vision-language tasks. However, existing …

AMEGO: Active Memory from long EGOcentric videos

G Goletto, T Nagarajan, G Averta, D Damen - European Conference on …, 2024 - Springer
Egocentric videos provide a unique perspective into individuals' daily experiences, yet their
unstructured nature presents challenges for perception. In this paper, we introduce AMEGO …

Vidcompress: Memory-enhanced temporal compression for video understanding in large language models

X Lan, Y Yuan, Z Jie, L Ma - arxiv preprint arxiv:2410.11417, 2024 - arxiv.org
Video-based multimodal large language models (Video-LLMs) possess significant potential
for video understanding tasks. However, most Video-LLMs treat videos as a sequential set of …

Physgame: Uncovering physical commonsense violations in gameplay videos

M Cao, H Tang, H Zhao, H Guo, J Liu, G Zhang… - arxiv preprint arxiv …, 2024 - arxiv.org
Recent advancements in video-based large language models (Video LLMs) have witnessed
the emergence of diverse capabilities to reason and interpret dynamic visual content …

Videochat-flash: Hierarchical compression for long-context video modeling

X Li, Y Wang, J Yu, X Zeng, Y Zhu, H Huang… - arxiv preprint arxiv …, 2024 - arxiv.org
Long-context modeling is a critical capability for multimodal large language models
(MLLMs), enabling them to process long-form contents with implicit memorization. Despite …

Cg-bench: Clue-grounded question answering benchmark for long video understanding

G Chen, Y Liu, Y Huang, Y He, B Pei, J Xu… - arxiv preprint arxiv …, 2024 - arxiv.org
Most existing video understanding benchmarks for multimodal large language models
(MLLMs) focus only on short videos. The limited number of benchmarks for long video …