Has multimodal learning delivered universal intelligence in healthcare? A comprehensive survey

Q Lin, Y Zhu, X Mei, L Huang, J Ma, K He, Z Peng… - Information …, 2024 - Elsevier
The rapid development of artificial intelligence has constantly reshaped the field of
intelligent healthcare and medicine. As a vital technology, multimodal learning has …

From image to language: A critical analysis of visual question answering (vqa) approaches, challenges, and opportunities

MF Ishmam, MSH Shovon, MF Mridha, N Dey - Information Fusion, 2024 - Elsevier
The multimodal task of Visual Question Answering (VQA) encompassing elements of
Computer Vision (CV) and Natural Language Processing (NLP), aims to generate answers …

Robust Visual Question Answering utilizing Bias Instances and Label Imbalance

L Zhao, K Li, J Qi, Y Sun, Z Zhu - Knowledge-Based Systems, 2024 - Elsevier
Abstract Visual Question Answering (VQA) models often suffer from bias issues which cause
their predictions to rely on superficial correlations in datasets rather than the intrinsic …

Natural Language Understanding and Inference with MLLM in Visual Question Answering: A Survey

J Kuang, Y Shen, J **e, H Luo, Z Xu, R Li, Y Li… - ACM Computing …, 2024 - dl.acm.org
Visual Question Answering (VQA) is a challenge task that combines natural language
processing and computer vision techniques and gradually becomes a benchmark test task …

Show me what and where has changed? question answering and grounding for remote sensing change detection

K Li, F Dong, D Wang, S Li, Q Wang, X Gao… - arxiv preprint arxiv …, 2024 - arxiv.org
Remote sensing change detection aims to perceive changes occurring on the Earth's
surface from remote sensing data in different periods, and feed these changes back to …

Bias-guided margin loss for robust Visual Question Answering

Y Sun, J Qi, Z Zhu, K Li, L Zhao, L Lv - Information Processing & …, 2025 - Elsevier
Abstract Visual Question Answering (VQA) suffers from language prior issue, where models
tend to rely on dataset biases to answer the questions while ignoring the image information …

A Hitchhikers Guide to Fine-Grained Face Forgery Detection Using Common Sense Reasoning

NM Foteinopoulou, E Ghorbel, D Aouada - arxiv preprint arxiv …, 2024 - arxiv.org
Explainability in artificial intelligence is crucial for restoring trust, particularly in areas like
face forgery detection, where viewers often struggle to distinguish between real and …

SURE-VQA: Systematic Understanding of Robustness Evaluation in Medical VQA Tasks

KC Kahl, S Erkan, J Traub, CT Lüth… - arxiv preprint arxiv …, 2024 - arxiv.org
Vision-Language Models (VLMs) have great potential in medical tasks, like Visual Question
Answering (VQA), where they could act as interactive assistants for both patients and …

Task Progressive Curriculum Learning for Robust Visual Question Answering

A Akl, A Khamis, Z Wang, A Cheraghian… - arxiv preprint arxiv …, 2024 - arxiv.org
Visual Question Answering (VQA) systems are known for their poor performance in out-of-
distribution datasets. An issue that was addressed in previous works through ensemble …

When language and vision meet road safety: leveraging multimodal large language models for video-based traffic accident analysis

R Zhang, B Wang, J Zhang, Z Bian, C Feng… - arxiv preprint arxiv …, 2025 - arxiv.org
The increasing availability of traffic videos functioning on a 24/7/365 time scale has the great
potential of increasing the spatio-temporal coverage of traffic accidents, which will help …