A comprehensive review of multimodal large language models: Performance and challenges across different tasks

J Wang, H Jiang, Y Liu, C Ma, X Zhang, Y Pan… - arxiv preprint arxiv …, 2024 - arxiv.org
In an era defined by the explosive growth of data and rapid technological advancements,
Multimodal Large Language Models (MLLMs) stand at the forefront of artificial intelligence …

Has multimodal learning delivered universal intelligence in healthcare? A comprehensive survey

Q Lin, Y Zhu, X Mei, L Huang, J Ma, K He, Z Peng… - Information …, 2024 - Elsevier
The rapid development of artificial intelligence has constantly reshaped the field of
intelligent healthcare and medicine. As a vital technology, multimodal learning has …

How far are we to gpt-4v? closing the gap to commercial multimodal models with open-source suites

Z Chen, W Wang, H Tian, S Ye, Z Gao, E Cui… - Science China …, 2024 - Springer
In this paper, we introduce InternVL 1.5, an open-source multimodal large language model
(MLLM) to bridge the capability gap between open-source and proprietary commercial …

Llava-onevision: Easy visual task transfer

B Li, Y Zhang, D Guo, R Zhang, F Li, H Zhang… - arxiv preprint arxiv …, 2024 - arxiv.org
We present LLaVA-OneVision, a family of open large multimodal models (LMMs) developed
by consolidating our insights into data, models, and visual representations in the LLaVA …

Cambrian-1: A fully open, vision-centric exploration of multimodal llms

S Tong, E Brown, P Wu, S Woo, M Middepogu… - arxiv preprint arxiv …, 2024 - arxiv.org
We introduce Cambrian-1, a family of multimodal LLMs (MLLMs) designed with a vision-
centric approach. While stronger language models can enhance multimodal capabilities, the …

Minicpm-v: A gpt-4v level mllm on your phone

Y Yao, T Yu, A Zhang, C Wang, J Cui, H Zhu… - arxiv preprint arxiv …, 2024 - arxiv.org
The recent surge of Multimodal Large Language Models (MLLMs) has fundamentally
reshaped the landscape of AI research and industry, shedding light on a promising path …

Paligemma: A versatile 3b vlm for transfer

L Beyer, A Steiner, AS Pinto, A Kolesnikov… - arxiv preprint arxiv …, 2024 - arxiv.org
PaliGemma is an open Vision-Language Model (VLM) that is based on the SigLIP-So400m
vision encoder and the Gemma-2B language model. It is trained to be a versatile and …

Internlm-xcomposer-2.5: A versatile large vision language model supporting long-contextual input and output

P Zhang, X Dong, Y Zang, Y Cao, R Qian… - arxiv preprint arxiv …, 2024 - arxiv.org
We present InternLM-XComposer-2.5 (IXC-2.5), a versatile large-vision language model that
supports long-contextual input and output. IXC-2.5 excels in various text-image …

Vila-u: a unified foundation model integrating visual understanding and generation

Y Wu, Z Zhang, J Chen, H Tang, D Li, Y Fang… - arxiv preprint arxiv …, 2024 - arxiv.org
VILA-U is a Unified foundation model that integrates Video, Image, Language understanding
and generation. Traditional visual language models (VLMs) use separate modules for …

Eagle: Exploring the design space for multimodal llms with mixture of encoders

M Shi, F Liu, S Wang, S Liao, S Radhakrishnan… - arxiv preprint arxiv …, 2024 - arxiv.org
The ability to accurately interpret complex visual information is a crucial topic of multimodal
large language models (MLLMs). Recent work indicates that enhanced visual perception …