Vision-language models for medical report generation and visual question answering: A review

I Hartsock, G Rasool - Frontiers in Artificial Intelligence, 2024 - frontiersin.org
Medical vision-language models (VLMs) combine computer vision (CV) and natural
language processing (NLP) to analyze visual and textual medical data. Our paper reviews …

Visual–language foundation models in medicine

C Liu, Y **, Z Guan, T Li, Y Qin, B Qian, Z Jiang… - The Visual …, 2024 - Springer
By integrating visual and linguistic understanding, visual–language foundation models
(VLFMs) have the great potential to advance the interpretation of medical data, thereby …

TF-FAS: twofold-element fine-grained semantic guidance for generalizable face anti-spoofing

X Wang, KY Zhang, T Yao, Q Zhou, S Ding… - … on Computer Vision, 2024 - Springer
Generalizable Face anti-spoofing (FAS) approaches have recently garnered considerable
attention due to their robustness in unseen scenarios. Some recent methods incorporate …

Can llms' tuning methods work in medical multimodal domain?

J Chen, Y Jiang, D Yang, M Li, J Wei, Z Qian… - … Conference on Medical …, 2024 - Springer
Abstract While Large Language Models (LLMs) excel in world knowledge understanding,
adapting them to specific subfields requires precise adjustments. Due to the model's vast …

Towards a generalizable pathology foundation model via unified knowledge distillation

J Ma, Z Guo, F Zhou, Y Wang, Y Xu, Y Cai… - arxiv preprint arxiv …, 2024 - arxiv.org
Foundation models pretrained on large-scale datasets are revolutionizing the field of
computational pathology (CPath). The generalization ability of foundation models is crucial …

C3Net: Compound Conditioned ControlNet for Multimodal Content Generation

J Zhang, Y Liu, YW Tai… - Proceedings of the IEEE …, 2024 - openaccess.thecvf.com
Abstract We present Compound Conditioned ControlNet C3Net a novel generative neural
architecture taking conditions from multiple modalities and synthesizing multimodal contents …

MISS: A Generative Pre-training and Fine-Tuning Approach for Med-VQA

J Chen, D Yang, Y Jiang, Y Lei, L Zhang - International Conference on …, 2024 - Springer
Medical visual question answering (VQA) is a challenging multimodal task, where Vision-
Language Pre-trained (VLP) models can effectively improve the generalization performance …

Advancing multimodal medical capabilities of Gemini

L Yang, S Xu, A Sellergren, T Kohlberger… - arxiv preprint arxiv …, 2024 - arxiv.org
Many clinical tasks require an understanding of specialized data, such as medical images
and genomics, which is not typically found in general-purpose large multimodal models …

Medical vision language pretraining: A survey

P Shrestha, S Amgain, B Khanal, CA Linte… - arxiv preprint arxiv …, 2023 - arxiv.org
Medical Vision Language Pretraining (VLP) has recently emerged as a promising solution to
the scarcity of labeled data in the medical domain. By leveraging paired/unpaired vision and …

Interpretable medical image visual question answering via multi-modal relationship graph learning

X Hu, L Gu, K Kobayashi, L Liu, M Zhang… - Medical Image …, 2024 - Elsevier
Abstract Medical Visual Question Answering (VQA) is an important task in medical multi-
modal Large Language Models (LLMs), aiming to answer clinically relevant questions …