AV-Odyssey Bench: Can Your Multimodal LLMs Really Understand Audio-Visual Information?
Recently, multimodal large language models (MLLMs), such as GPT-4o, Gemini 1.5 Pro,
and Reka Core, have expanded their capabilities to include vision and audio modalities …
and Reka Core, have expanded their capabilities to include vision and audio modalities …
ICT: Image-Object Cross-Level Trusted Intervention for Mitigating Object Hallucination in Large Vision-Language Models
Despite the recent breakthroughs achieved by Large Vision Language Models (LVLMs) in
understanding and responding to complex visual-textual contexts, their inherent …
understanding and responding to complex visual-textual contexts, their inherent …
Vision-Language Models Represent Darker-Skinned Black Individuals as More Homogeneous than Lighter-Skinned Black Individuals
MHJ Lee, S Jeon - arxiv preprint arxiv:2412.09668, 2024 - arxiv.org
Vision-Language Models (VLMs) combine Large Language Model (LLM) capabilities with
image processing, enabling tasks like image captioning and text-to-image generation. Yet …
image processing, enabling tasks like image captioning and text-to-image generation. Yet …
Self-Training Large Language and Vision Assistant for Medical Question Answering
Abstract Large Vision-Language Models (LVLMs) have shown significant potential in
assisting medical diagnosis by leveraging extensive biomedical datasets. However, the …
assisting medical diagnosis by leveraging extensive biomedical datasets. However, the …
From Linguistic Giants to Sensory Maestros: A Survey on Cross-Modal Reasoning with Large Language Models
Cross-modal reasoning (CMR), the intricate process of synthesizing and drawing inferences
across divergent sensory modalities, is increasingly recognized as a crucial capability in the …
across divergent sensory modalities, is increasingly recognized as a crucial capability in the …