The revolution of multimodal large language models: a survey

D Caffagni, F Cocchi, L Barsellotti, N Moratelli… - arxiv preprint arxiv …, 2024 - arxiv.org
Connecting text and visual modalities plays an essential role in generative intelligence. For
this reason, inspired by the success of large language models, significant research efforts …

Controlmllm: Training-free visual prompt learning for multimodal large language models

M Wu, X Cai, J Ji, J Li, O Huang… - Advances in …, 2025 - proceedings.neurips.cc
In this work, we propose a training-free method to inject visual prompts into Multimodal
Large Language Models (MLLMs) through learnable latent variable optimization. We …

Investigating and mitigating the multimodal hallucination snowballing in large vision-language models

W Zhong, X Feng, L Zhao, Q Li, L Huang, Y Gu… - arxiv preprint arxiv …, 2024 - arxiv.org
Though advanced in understanding visual information with human languages, Large Vision-
Language Models (LVLMs) still suffer from multimodal hallucinations. A natural concern is …

[PDF][PDF] Artemis: Towards referential understanding in complex videos

J Qiu, Y Zhang, X Tang, L **e, T Ma… - The Thirty-eighth …, 2024 - proceedings.neurips.cc
Videos carry rich visual information including object description, action, interaction, etc., but
the existing multimodal large language models (MLLMs) fell short in referential …

Videorefer suite: Advancing spatial-temporal object understanding with video llm

Y Yuan, H Zhang, W Li, Z Cheng, B Zhang, L Li… - arxiv preprint arxiv …, 2024 - arxiv.org
Video Large Language Models (Video LLMs) have recently exhibited remarkable
capabilities in general video understanding. However, they mainly focus on holistic …

ClawMachine: Learning to Fetch Visual Tokens for Referential Comprehension

T Ma, L **e, Y Tian, B Yang, Q Ye - arxiv preprint arxiv:2406.11327, 2024 - arxiv.org
Aligning vision and language concepts at a finer level remains an essential topic of
multimodal large language models (MLLMs), particularly for tasks such as referring and …

Omni-RGPT: Unifying Image and Video Region-level Understanding via Token Marks

M Heo, MH Chen, DA Huang, S Liu… - arxiv preprint arxiv …, 2025 - arxiv.org
We present Omni-RGPT, a multimodal large language model designed to facilitate region-
level comprehension for both images and videos. To achieve consistent region …

Personalized Large Vision-Language Models

C Pham, H Phan, D Doermann, Y Tian - arxiv preprint arxiv:2412.17610, 2024 - arxiv.org
The personalization model has gained significant attention in image generation yet remains
underexplored for large vision-language models (LVLMs). Beyond generic ones, with …

Visual Large Language Models for Generalized and Specialized Applications

Y Li, Z Lai, W Bao, Z Tan, A Dao, K Sui, J Shen… - arxiv preprint arxiv …, 2025 - arxiv.org
Visual-language models (VLM) have emerged as a powerful tool for learning a unified
embedding space for vision and language. Inspired by large language models, which have …

Exploring Spatial Language Grounding Through Referring Expressions

A Tumu, P Kordjamshidi - arxiv preprint arxiv:2502.04359, 2025 - arxiv.org
Spatial Reasoning is an important component of human cognition and is an area in which
the latest Vision-language models (VLMs) show signs of difficulty. The current analysis …