Kangaroo: A powerful video-language model supporting long-context video input
Rapid advancements have been made in extending Large Language Models (LLMs) to
Large Multi-modal Models (LMMs). However, extending input modality of LLMs to video data …
Large Multi-modal Models (LMMs). However, extending input modality of LLMs to video data …
Mme-survey: A comprehensive survey on evaluation of multimodal llms
As a prominent direction of Artificial General Intelligence (AGI), Multimodal Large Language
Models (MLLMs) have garnered increased attention from both industry and academia …
Models (MLLMs) have garnered increased attention from both industry and academia …
Cinepile: A long video question answering dataset and benchmark
Current datasets for long-form video understanding often fall short of providing genuine long-
form comprehension challenges, as many tasks derived from these datasets can be …
form comprehension challenges, as many tasks derived from these datasets can be …
Apollo: An exploration of video understanding in large multimodal models
Despite the rapid integration of video perception capabilities into Large Multimodal Models
(LMMs), the underlying mechanisms driving their video understanding remain poorly …
(LMMs), the underlying mechanisms driving their video understanding remain poorly …
Timemarker: A versatile video-llm for long and short video understanding with superior temporal localization ability
Rapid development of large language models (LLMs) has significantly advanced multimodal
large language models (LMMs), particularly in vision-language tasks. However, existing …
large language models (LMMs), particularly in vision-language tasks. However, existing …
AMEGO: Active Memory from long EGOcentric videos
Egocentric videos provide a unique perspective into individuals' daily experiences, yet their
unstructured nature presents challenges for perception. In this paper, we introduce AMEGO …
unstructured nature presents challenges for perception. In this paper, we introduce AMEGO …
Vidcompress: Memory-enhanced temporal compression for video understanding in large language models
Video-based multimodal large language models (Video-LLMs) possess significant potential
for video understanding tasks. However, most Video-LLMs treat videos as a sequential set of …
for video understanding tasks. However, most Video-LLMs treat videos as a sequential set of …
Physgame: Uncovering physical commonsense violations in gameplay videos
Recent advancements in video-based large language models (Video LLMs) have witnessed
the emergence of diverse capabilities to reason and interpret dynamic visual content …
the emergence of diverse capabilities to reason and interpret dynamic visual content …
Videochat-flash: Hierarchical compression for long-context video modeling
Long-context modeling is a critical capability for multimodal large language models
(MLLMs), enabling them to process long-form contents with implicit memorization. Despite …
(MLLMs), enabling them to process long-form contents with implicit memorization. Despite …
Cg-bench: Clue-grounded question answering benchmark for long video understanding
Most existing video understanding benchmarks for multimodal large language models
(MLLMs) focus only on short videos. The limited number of benchmarks for long video …
(MLLMs) focus only on short videos. The limited number of benchmarks for long video …