AV-Odyssey Bench: Can Your Multimodal LLMs Really Understand Audio-Visual Information?

K Gong, K Feng, B Li, Y Wang, M Cheng… - arxiv preprint arxiv …, 2024 - arxiv.org
Recently, multimodal large language models (MLLMs), such as GPT-4o, Gemini 1.5 Pro,
and Reka Core, have expanded their capabilities to include vision and audio modalities …

ICT: Image-Object Cross-Level Trusted Intervention for Mitigating Object Hallucination in Large Vision-Language Models

J Chen, T Zhang, S Huang, Y Niu, L Zhang… - arxiv preprint arxiv …, 2024 - arxiv.org
Despite the recent breakthroughs achieved by Large Vision Language Models (LVLMs) in
understanding and responding to complex visual-textual contexts, their inherent …

Vision-Language Models Represent Darker-Skinned Black Individuals as More Homogeneous than Lighter-Skinned Black Individuals

MHJ Lee, S Jeon - arxiv preprint arxiv:2412.09668, 2024 - arxiv.org
Vision-Language Models (VLMs) combine Large Language Model (LLM) capabilities with
image processing, enabling tasks like image captioning and text-to-image generation. Yet …

Self-Training Large Language and Vision Assistant for Medical Question Answering

G Sun, C Qin, H Fu, L Wang, Z Tao - Proceedings of the 2024 …, 2024 - aclanthology.org
Abstract Large Vision-Language Models (LVLMs) have shown significant potential in
assisting medical diagnosis by leveraging extensive biomedical datasets. However, the …

From Linguistic Giants to Sensory Maestros: A Survey on Cross-Modal Reasoning with Large Language Models

S Qian, Z Zhou, D Xue, B Wang, C Xu - arxiv preprint arxiv:2409.18996, 2024 - arxiv.org
Cross-modal reasoning (CMR), the intricate process of synthesizing and drawing inferences
across divergent sensory modalities, is increasingly recognized as a crucial capability in the …