Mm-llms: Recent advances in multimodal large language models

D Zhang, Y Yu, J Dong, C Li, D Su, C Chu… - arxiv preprint arxiv …, 2024 - arxiv.org
In the past year, MultiModal Large Language Models (MM-LLMs) have undergone
substantial advancements, augmenting off-the-shelf LLMs to support MM inputs or outputs …

Mmbench: Is your multi-modal model an all-around player?

Y Liu, H Duan, Y Zhang, B Li, S Zhang, W Zhao… - European conference on …, 2024 - Springer
Large vision-language models (VLMs) have recently achieved remarkable progress,
exhibiting impressive multimodal perception and reasoning abilities. However, effectively …

Minicpm-v: A gpt-4v level mllm on your phone

Y Yao, T Yu, A Zhang, C Wang, J Cui, H Zhu… - arxiv preprint arxiv …, 2024 - arxiv.org
The recent surge of Multimodal Large Language Models (MLLMs) has fundamentally
reshaped the landscape of AI research and industry, shedding light on a promising path …

Blink: Multimodal large language models can see but not perceive

X Fu, Y Hu, B Li, Y Feng, H Wang, X Lin, D Roth… - … on Computer Vision, 2024 - Springer
We introduce Blink, a new benchmark for multimodal language models (LLMs) that focuses
on core visual perception abilities not found in other evaluations. Most of the Blink tasks can …

Omg-llava: Bridging image-level, object-level, pixel-level reasoning and understanding

T Zhang, X Li, H Fei, H Yuan, S Wu… - Advances in …, 2025 - proceedings.neurips.cc
Current universal segmentation methods demonstrate strong capabilities in pixel-level
image and video understanding. However, they lack reasoning abilities and cannot be …

Internlm-xcomposer2: Mastering free-form text-image composition and comprehension in vision-language large model

X Dong, P Zhang, Y Zang, Y Cao, B Wang… - arxiv preprint arxiv …, 2024 - arxiv.org
We introduce InternLM-XComposer2, a cutting-edge vision-language model excelling in free-
form text-image composition and comprehension. This model goes beyond conventional …

Mmt-bench: A comprehensive multimodal benchmark for evaluating large vision-language models towards multitask agi

K Ying, F Meng, J Wang, Z Li, H Lin, Y Yang… - arxiv preprint arxiv …, 2024 - arxiv.org
Large Vision-Language Models (LVLMs) show significant strides in general-purpose
multimodal applications such as visual dialogue and embodied navigation. However …

Moai: Mixture of all intelligence for large language and vision models

BK Lee, B Park, C Won Kim, Y Man Ro - European Conference on …, 2024 - Springer
The rise of large language models (LLMs) and instruction tuning has led to the current trend
of instruction-tuned large language and vision models (LLVMs). This trend involves either …

A survey on evaluation of multimodal large language models

J Huang, J Zhang - arxiv preprint arxiv:2408.15769, 2024 - arxiv.org
Multimodal Large Language Models (MLLMs) mimic human perception and reasoning
system by integrating powerful Large Language Models (LLMs) with various modality …

Video-ccam: Enhancing video-language understanding with causal cross-attention masks for short and long videos

J Fei, D Li, Z Deng, Z Wang, G Liu, H Wang - arxiv preprint arxiv …, 2024 - arxiv.org
Multi-modal large language models (MLLMs) have demonstrated considerable potential
across various downstream tasks that require cross-domain knowledge. MLLMs capable of …