Mme-survey: A comprehensive survey on evaluation of multimodal llms

C Fu, YF Zhang, S Yin, B Li, X Fang, S Zhao… - arxiv preprint arxiv …, 2024 - arxiv.org
As a prominent direction of Artificial General Intelligence (AGI), Multimodal Large Language
Models (MLLMs) have garnered increased attention from both industry and academia …

LongLLaVA: Scaling Multi-modal LLMs to 1000 Images Efficiently via a Hybrid Architecture

X Wang, D Song, S Chen, C Zhang, B Wang - arxiv preprint arxiv …, 2024 - arxiv.org
Expanding the long-context capabilities of Multi-modal Large Language Models~(MLLMs) is
crucial for video understanding, high-resolution image understanding, and multi-modal …

Video-xl: Extra-long vision language model for hour-scale video understanding

Y Shu, P Zhang, Z Liu, M Qin, J Zhou, T Huang… - arxiv preprint arxiv …, 2024 - arxiv.org
Although current Multi-modal Large Language Models (MLLMs) demonstrate promising
results in video understanding, processing extremely long videos remains an ongoing …

A survey on multimodal benchmarks: In the era of large ai models

L Li, G Chen, H Shi, J **ao, L Chen - arxiv preprint arxiv:2409.18142, 2024 - arxiv.org
The rapid evolution of Multimodal Large Language Models (MLLMs) has brought substantial
advancements in artificial intelligence, significantly enhancing the capability to understand …

Videochat-flash: Hierarchical compression for long-context video modeling

X Li, Y Wang, J Yu, X Zeng, Y Zhu, H Huang… - arxiv preprint arxiv …, 2024 - arxiv.org
Long-context modeling is a critical capability for multimodal large language models
(MLLMs), enabling them to process long-form contents with implicit memorization. Despite …

Visual Haystacks: A Vision-Centric Needle-In-A-Haystack Benchmark

TH Wu, G Biamby, J Quenum, R Gupta… - arxiv preprint arxiv …, 2024 - arxiv.org
Large Multimodal Models (LMMs) have made significant strides in visual question-
answering for single images. Recent advancements like long-context LMMs have allowed …

InternVideo2. 5: Empowering Video MLLMs with Long and Rich Context Modeling

Y Wang, X Li, Z Yan, Y He, J Yu, X Zeng… - arxiv preprint arxiv …, 2025 - arxiv.org
This paper aims to improve the performance of video multimodal large language models
(MLLM) via long and rich context (LRC) modeling. As a result, we develop a new version of …

VCBench: A Controllable Benchmark for Symbolic and Abstract Challenges in Video Cognition

C Li, Q Chen, Z Li, F Tao, Y Zhang - arxiv preprint arxiv:2411.09105, 2024 - arxiv.org
Recent advancements in Large Video-Language Models (LVLMs) have driven the
development of benchmarks designed to assess cognitive abilities in video-based tasks …