[HTML][HTML] Multimodal large language models in health care: applications, challenges, and future outlook

R AlSaad, A Abd-Alrazaq, S Boughorbel… - Journal of medical …, 2024 - jmir.org
In the complex and multidimensional field of medicine, multimodal data are prevalent and
crucial for informed clinical decisions. Multimodal data span a broad spectrum of data types …

Exploring the frontier of vision-language models: A survey of current methodologies and future directions

A Ghosh, A Acharya, S Saha, V Jain… - arxiv preprint arxiv …, 2024 - arxiv.org
The advent of Large Language Models (LLMs) has significantly reshaped the trajectory of
the AI revolution. Nevertheless, these LLMs exhibit a notable limitation, as they are primarily …

3d-vla: A 3d vision-language-action generative world model

H Zhen, X Qiu, P Chen, J Yang, X Yan, Y Du… - arxiv preprint arxiv …, 2024 - arxiv.org
Recent vision-language-action (VLA) models rely on 2D inputs, lacking integration with the
broader realm of the 3D physical world. Furthermore, they perform action prediction by …

Learning visual grounding from generative vision and language model

S Wang, D Kim, A Taalimi, C Sun, W Kuo - arxiv preprint arxiv:2407.14563, 2024 - arxiv.org
Visual grounding tasks aim to localize image regions based on natural language references.
In this work, we explore whether generative VLMs predominantly trained on image-text data …

Learning to correction: Explainable feedback generation for visual commonsense reasoning distractor

J Chen, X Hei, Y Xue, Y Wei, J **e, Y Cai… - Proceedings of the 32nd …, 2024 - dl.acm.org
Large multimodal models (LMMs) have shown remarkable performance in the visual
commonsense reasoning (VCR) task, which aims to answer a multiple-choice question …

Kestrel: Point Grounding Multimodal LLM for Part-Aware 3D Vision-Language Understanding

J Fei, M Ahmed, J Ding, EM Bakr… - arxiv preprint arxiv …, 2024 - arxiv.org
While 3D MLLMs have achieved significant progress, they are restricted to object and scene
understanding and struggle to understand 3D spatial structures at the part level. In this …

Emerging Pixel Grounding in Large Multimodal Models Without Grounding Supervision

S Cao, LY Gui, YX Wang - arxiv preprint arxiv:2410.08209, 2024 - arxiv.org
Current large multimodal models (LMMs) face challenges in grounding, which requires the
model to relate language components to visual entities. Contrary to the common practice …

Progressive Multi-granular Alignments for Grounded Reasoning in Large Vision-Language Models

QH Le, LH Dang, N Le, T Tran, TM Le - arxiv preprint arxiv:2412.08125, 2024 - arxiv.org
Existing Large Vision-Language Models (LVLMs) excel at matching concepts across multi-
modal inputs but struggle with compositional concepts and high-level relationships between …

EditScout: Locating Forged Regions from Diffusion-based Edited Images with Multimodal LLM

Q Nguyen, T Vu, TT Nguyen, Y Wen… - arxiv preprint arxiv …, 2024 - arxiv.org
Image editing technologies are tools used to transform, adjust, remove, or otherwise alter
images. Recent research has significantly improved the capabilities of image editing tools …

VIKSER: Visual Knowledge-Driven Self-Reinforcing Reasoning Framework

C Zhang, C Wang, Y Zhou, Y Peng - arxiv preprint arxiv:2502.00711, 2025 - arxiv.org
Visual reasoning refers to the task of solving questions about visual information. Current
visual reasoning methods typically employ pre-trained vision-language model (VLM) …