Seeing is Deceiving: Exploitation of Visual Pathways in Multi-Modal Language Models
P Janowczyk, L Laurier, A Giulietta, A Octavia… - arxiv preprint arxiv …, 2024 - arxiv.org
Multi-Modal Language Models (MLLMs) have transformed artificial intelligence by
combining visual and text data, making applications like image captioning, visual question …
combining visual and text data, making applications like image captioning, visual question …
VideoLLaMA 3: Frontier Multimodal Foundation Models for Image and Video Understanding
In this paper, we propose VideoLLaMA3, a more advanced multimodal foundation model for
image and video understanding. The core design philosophy of VideoLLaMA3 is vision …
image and video understanding. The core design philosophy of VideoLLaMA3 is vision …
From Specific-MLLM to Omni-MLLM: A Survey about the MLLMs alligned with Multi-Modality
From the Specific-MLLM, which excels in single-modal tasks, to the Omni-MLLM, which
extends the range of general modalities, this evolution aims to achieve understanding and …
extends the range of general modalities, this evolution aims to achieve understanding and …
Mitigating Hallucination for Large Vision Language Model by Inter-Modality Correlation Calibration Decoding
J Li, J Zhang, Z Jie, L Ma, G Li - arxiv preprint arxiv:2501.01926, 2025 - arxiv.org
Large vision-language models (LVLMs) have shown remarkable capabilities in visual-
language understanding for downstream multi-modal tasks. Despite their success, LVLMs …
language understanding for downstream multi-modal tasks. Despite their success, LVLMs …