Deepseek-vl2: Mixture-of-experts vision-language models for advanced multimodal understanding

Z Wu, X Chen, Z Pan, X Liu, W Liu, D Dai… - arxiv preprint arxiv …, 2024 - arxiv.org
We present DeepSeek-VL2, an advanced series of large Mixture-of-Experts (MoE) Vision-
Language Models that significantly improves upon its predecessor, DeepSeek-VL, through …

Apollo: An exploration of video understanding in large multimodal models

O Zohar, X Wang, Y Dubois, N Mehta, T **ao… - arxiv preprint arxiv …, 2024 - arxiv.org
Despite the rapid integration of video perception capabilities into Large Multimodal Models
(LMMs), the underlying mechanisms driving their video understanding remain poorly …

Minimax-01: Scaling foundation models with lightning attention

A Li, B Gong, B Yang, B Shan, C Liu, C Zhu… - arxiv preprint arxiv …, 2025 - arxiv.org
We introduce MiniMax-01 series, including MiniMax-Text-01 and MiniMax-VL-01, which are
comparable to top-tier models while offering superior capabilities in processing longer …

MEGA-Bench: Scaling multimodal evaluation to over 500 real-world tasks

J Chen, T Liang, S Siu, Z Wang, K Wang… - arxiv preprint arxiv …, 2024 - arxiv.org
We present MEGA-Bench, an evaluation suite that scales multimodal evaluation to over 500
real-world tasks, to address the highly heterogeneous daily use cases of end users. Our …

Worldcuisines: A massive-scale benchmark for multilingual and multicultural visual question answering on global cuisines

GI Winata, F Hudi, PA Irawan, D Anugraha… - arxiv preprint arxiv …, 2024 - arxiv.org
Vision Language Models (VLMs) often struggle with culture-specific knowledge, particularly
in languages other than English and in underrepresented cultural contexts. To evaluate their …

Enabling harmonious human-machine interaction with visual-context augmented dialogue system: A review

H Wang, B Guo, Y Zeng, M Chen, Y Ding… - ACM Transactions on …, 2022 - dl.acm.org
The intelligent dialogue system, aiming at communicating with humans harmoniously with
natural language, is brilliant for promoting the advancement of human-machine interaction …

TVBench: Redesigning Video-Language Evaluation

D Cores, M Dorkenwald, M Mucientes… - arxiv preprint arxiv …, 2024 - arxiv.org
Large language models have demonstrated impressive performance when integrated with
vision models even enabling video understanding. However, evaluating these video models …

VISTA: Enhancing Long-Duration and High-Resolution Video Understanding by Video Spatiotemporal Augmentation

W Ren, H Yang, J Min, C Wei, W Chen - arxiv preprint arxiv:2412.00927, 2024 - arxiv.org
Current large multimodal models (LMMs) face significant challenges in processing and
comprehending long-duration or high-resolution videos, which is mainly due to the lack of …

MotionBench: Benchmarking and Improving Fine-grained Video Motion Understanding for Vision Language Models

W Hong, Y Cheng, Z Yang, W Wang, L Wang… - arxiv preprint arxiv …, 2025 - arxiv.org
In recent years, vision language models (VLMs) have made significant advancements in
video understanding. However, a crucial capability-fine-grained motion comprehension …

ChatRex: Taming Multimodal LLM for Joint Perception and Understanding

Q Jiang, Y Yang, Y **ong, Y Chen, Z Zeng… - arxiv preprint arxiv …, 2024 - arxiv.org
Perception and understanding are two pillars of computer vision. While multimodal large
language models (MLLM) have demonstrated remarkable visual understanding capabilities …