A Survey of Multimodel Large Language Models

Z Liang, Y Xu, Y Hong, P Shang, Q Wang… - Proceedings of the 3rd …, 2024 - dl.acm.org
With the widespread application of the Transformer architecture in various modalities,
including vision, the technology of large language models is evolving from a single modality …

Clip in medical imaging: A comprehensive survey

Z Zhao, Y Liu, H Wu, M Wang, Y Li, S Wang… - arxiv preprint arxiv …, 2023 - arxiv.org
Contrastive Language-Image Pre-training (CLIP), a simple yet effective pre-training
paradigm, successfully introduces text supervision to vision models. It has shown promising …

What matters when building vision-language models?

H Laurençon, L Tronchon, M Cord… - Advances in Neural …, 2025 - proceedings.neurips.cc
The growing interest in vision-language models (VLMs) has been driven by improvements in
large language models and vision transformers. Despite the abundance of literature on this …

Multimodal foundation models: From specialists to general-purpose assistants

C Li, Z Gan, Z Yang, J Yang, L Li… - … and Trends® in …, 2024 - nowpublishers.com
Neural compression is the application of neural networks and other machine learning
methods to data compression. Recent advances in statistical machine learning have opened …

Mathvista: Evaluating mathematical reasoning of foundation models in visual contexts

P Lu, H Bansal, T **a, J Liu, C Li, H Hajishirzi… - arxiv preprint arxiv …, 2023 - arxiv.org
Large Language Models (LLMs) and Large Multimodal Models (LMMs) exhibit impressive
problem-solving skills in many tasks and domains, but their ability in mathematical …

Towards generalist foundation model for radiology by leveraging web-scale 2D&3D medical data

C Wu, X Zhang, Y Zhang, Y Wang, W **e - arxiv preprint arxiv:2308.02463, 2023 - arxiv.org
In this study, we aim to initiate the development of Radiology Foundation Model, termed as
RadFM. We consider the construction of foundational models from three perspectives …

A generalist vision–language foundation model for diverse biomedical tasks

K Zhang, R Zhou, E Adhikarla, Z Yan, Y Liu, J Yu… - Nature Medicine, 2024 - nature.com
Traditional biomedical artificial intelligence (AI) models, designed for specific tasks or
modalities, often exhibit limited flexibility in real-world deployment and struggle to utilize …

MediConfusion: Can you trust your AI radiologist? Probing the reliability of multimodal medical foundation models

MS Sepehri, Z Fabian, M Soltanolkotabi… - arxiv preprint arxiv …, 2024 - arxiv.org
Multimodal Large Language Models (MLLMs) have tremendous potential to improve the
accuracy, availability, and cost-effectiveness of healthcare by providing automated solutions …

Regiongpt: Towards region understanding vision language model

Q Guo, S De Mello, H Yin, W Byeon… - Proceedings of the …, 2024 - openaccess.thecvf.com
Vision language models (VLMs) have experienced rapid advancements through the
integration of large language models (LLMs) with image-text pairs yet they struggle with …

Omnimedvqa: A new large-scale comprehensive evaluation benchmark for medical lvlm

Y Hu, T Li, Q Lu, W Shao, J He… - Proceedings of the …, 2024 - openaccess.thecvf.com
Abstract Large Vision-Language Models (LVLMs) have demonstrated remarkable
capabilities in various multimodal tasks. However their potential in the medical domain …