Seeing is Deceiving: Exploitation of Visual Pathways in Multi-Modal Language Models

P Janowczyk, L Laurier, A Giulietta, A Octavia… - arxiv preprint arxiv …, 2024‏ - arxiv.org
Multi-Modal Language Models (MLLMs) have transformed artificial intelligence by
combining visual and text data, making applications like image captioning, visual question …

VideoLLaMA 3: Frontier Multimodal Foundation Models for Image and Video Understanding

B Zhang, K Li, Z Cheng, Z Hu, Y Yuan, G Chen… - arxiv preprint arxiv …, 2025‏ - arxiv.org
In this paper, we propose VideoLLaMA3, a more advanced multimodal foundation model for
image and video understanding. The core design philosophy of VideoLLaMA3 is vision …

From Specific-MLLM to Omni-MLLM: A Survey about the MLLMs alligned with Multi-Modality

S Jiang, J Liang, M Liu, B Qin - arxiv preprint arxiv:2412.11694, 2024‏ - arxiv.org
From the Specific-MLLM, which excels in single-modal tasks, to the Omni-MLLM, which
extends the range of general modalities, this evolution aims to achieve understanding and …

Mitigating Hallucination for Large Vision Language Model by Inter-Modality Correlation Calibration Decoding

J Li, J Zhang, Z Jie, L Ma, G Li - arxiv preprint arxiv:2501.01926, 2025‏ - arxiv.org
Large vision-language models (LVLMs) have shown remarkable capabilities in visual-
language understanding for downstream multi-modal tasks. Despite their success, LVLMs …