[HTML][HTML] Multimodal large language models in health care: applications, challenges, and future outlook
In the complex and multidimensional field of medicine, multimodal data are prevalent and
crucial for informed clinical decisions. Multimodal data span a broad spectrum of data types …
crucial for informed clinical decisions. Multimodal data span a broad spectrum of data types …
Exploring the frontier of vision-language models: A survey of current methodologies and future directions
The advent of Large Language Models (LLMs) has significantly reshaped the trajectory of
the AI revolution. Nevertheless, these LLMs exhibit a notable limitation, as they are primarily …
the AI revolution. Nevertheless, these LLMs exhibit a notable limitation, as they are primarily …
3d-vla: A 3d vision-language-action generative world model
Recent vision-language-action (VLA) models rely on 2D inputs, lacking integration with the
broader realm of the 3D physical world. Furthermore, they perform action prediction by …
broader realm of the 3D physical world. Furthermore, they perform action prediction by …
Learning visual grounding from generative vision and language model
Visual grounding tasks aim to localize image regions based on natural language references.
In this work, we explore whether generative VLMs predominantly trained on image-text data …
In this work, we explore whether generative VLMs predominantly trained on image-text data …
Learning to correction: Explainable feedback generation for visual commonsense reasoning distractor
Large multimodal models (LMMs) have shown remarkable performance in the visual
commonsense reasoning (VCR) task, which aims to answer a multiple-choice question …
commonsense reasoning (VCR) task, which aims to answer a multiple-choice question …
Kestrel: Point Grounding Multimodal LLM for Part-Aware 3D Vision-Language Understanding
While 3D MLLMs have achieved significant progress, they are restricted to object and scene
understanding and struggle to understand 3D spatial structures at the part level. In this …
understanding and struggle to understand 3D spatial structures at the part level. In this …
Emerging Pixel Grounding in Large Multimodal Models Without Grounding Supervision
Current large multimodal models (LMMs) face challenges in grounding, which requires the
model to relate language components to visual entities. Contrary to the common practice …
model to relate language components to visual entities. Contrary to the common practice …
Progressive Multi-granular Alignments for Grounded Reasoning in Large Vision-Language Models
Existing Large Vision-Language Models (LVLMs) excel at matching concepts across multi-
modal inputs but struggle with compositional concepts and high-level relationships between …
modal inputs but struggle with compositional concepts and high-level relationships between …
EditScout: Locating Forged Regions from Diffusion-based Edited Images with Multimodal LLM
Image editing technologies are tools used to transform, adjust, remove, or otherwise alter
images. Recent research has significantly improved the capabilities of image editing tools …
images. Recent research has significantly improved the capabilities of image editing tools …
VIKSER: Visual Knowledge-Driven Self-Reinforcing Reasoning Framework
C Zhang, C Wang, Y Zhou, Y Peng - arxiv preprint arxiv:2502.00711, 2025 - arxiv.org
Visual reasoning refers to the task of solving questions about visual information. Current
visual reasoning methods typically employ pre-trained vision-language model (VLM) …
visual reasoning methods typically employ pre-trained vision-language model (VLM) …