Do language models understand time?

X Ding, L Wang - arxiv preprint arxiv:2412.13845, 2024 - arxiv.org
Large language models (LLMs) have revolutionized video-based computer vision
applications, including action recognition, anomaly detection, and video summarization …

InternVideo2. 5: Empowering Video MLLMs with Long and Rich Context Modeling

Y Wang, X Li, Z Yan, Y He, J Yu, X Zeng… - arxiv preprint arxiv …, 2025 - arxiv.org
This paper aims to improve the performance of video multimodal large language models
(MLLM) via long and rich context (LRC) modeling. As a result, we develop a new version of …

VideoLLaMA 3: Frontier Multimodal Foundation Models for Image and Video Understanding

B Zhang, K Li, Z Cheng, Z Hu, Y Yuan, G Chen… - arxiv preprint arxiv …, 2025 - arxiv.org
In this paper, we propose VideoLLaMA3, a more advanced multimodal foundation model for
image and video understanding. The core design philosophy of VideoLLaMA3 is vision …

Learnings from Scaling Visual Tokenizers for Reconstruction and Generation

P Hansen-Estruch, D Yan, CY Chung, O Zohar… - arxiv preprint arxiv …, 2025 - arxiv.org
Visual tokenization via auto-encoding empowers state-of-the-art image and video
generative models by compressing pixels into a latent space. Although scaling Transformer …

Temporal Preference Optimization for Long-Form Video Understanding

R Li, X Wang, Y Zhang, Z Wang… - arxiv preprint arxiv …, 2025 - arxiv.org
Despite significant advancements in video large multimodal models (video-LMMs),
achieving effective temporal grounding in long-form videos remains a challenge for existing …

Redundancy Principles for MLLMs Benchmarks

Z Zhang, X Zhao, X Fang, C Li, X Liu, X Min… - arxiv preprint arxiv …, 2025 - arxiv.org
With the rapid iteration of Multi-modality Large Language Models (MLLMs) and the evolving
demands of the field, the number of benchmarks produced annually has surged into the …