Explainable and interpretable multimodal large language models: A comprehensive survey

Y Dang, K Huang, J Huo, Y Yan, S Huang, D Liu… - arxiv preprint arxiv …, 2024‏ - arxiv.org
The rapid development of Artificial Intelligence (AI) has revolutionized numerous fields, with
large language models (LLMs) and computer vision (CV) systems driving advancements in …

Expanding performance boundaries of open-source multimodal models with model, data, and test-time scaling

Z Chen, W Wang, Y Cao, Y Liu, Z Gao, E Cui… - arxiv preprint arxiv …, 2024‏ - arxiv.org
We introduce InternVL 2.5, an advanced multimodal large language model (MLLM) series
that builds upon InternVL 2.0, maintaining its core model architecture while introducing …

Mm1. 5: Methods, analysis & insights from multimodal llm fine-tuning

H Zhang, M Gao, Z Gan, P Dufter, N Wenzel… - arxiv preprint arxiv …, 2024‏ - arxiv.org
We present MM1. 5, a new family of multimodal large language models (MLLMs) designed
to enhance capabilities in text-rich image understanding, visual referring and grounding …

Mini-InternVL: a flexible-transfer pocket multi-modal model with 5% parameters and 90% performance

Z Gao, Z Chen, E Cui, Y Ren, W Wang, J Zhu, H Tian… - Visual Intelligence, 2024‏ - Springer
Multi-modal large language models (MLLMs) have demonstrated impressive performance in
vision-language tasks across a wide range of domains. However, the large model scale and …

NVILA: Efficient frontier visual language models

Z Liu, L Zhu, B Shi, Z Zhang, Y Lou, S Yang… - arxiv preprint arxiv …, 2024‏ - arxiv.org
Visual language models (VLMs) have made significant advances in accuracy in recent
years. However, their efficiency has received much less attention. This paper introduces …

Apollo: An exploration of video understanding in large multimodal models

O Zohar, X Wang, Y Dubois, N Mehta, T **ao… - arxiv preprint arxiv …, 2024‏ - arxiv.org
Despite the rapid integration of video perception capabilities into Large Multimodal Models
(LMMs), the underlying mechanisms driving their video understanding remain poorly …

Phantom of latent for large language and vision models

BK Lee, S Chung, CW Kim, B Park, YM Ro - arxiv preprint arxiv …, 2024‏ - arxiv.org
The success of visual instruction tuning has accelerated the development of large language
and vision models (LLVMs). Following the scaling laws of instruction-tuned large language …

Scaling inference-time search with vision value model for improved visual comprehension

W **yao, Y Zhengyuan, L Linjie, L Hong**… - arxiv preprint arxiv …, 2024‏ - arxiv.org
Despite significant advancements in vision-language models (VLMs), there lacks effective
approaches to enhance response quality by scaling inference-time computation. This …

Your mixture-of-experts llm is secretly an embedding model for free

Z Li, T Zhou - arxiv preprint arxiv:2410.10814, 2024‏ - arxiv.org
While large language models (LLMs) excel on generation tasks, their decoder-only
architecture often limits their potential as embedding models if no further representation …

Do language models understand time?

X Ding, L Wang - arxiv preprint arxiv:2412.13845, 2024‏ - arxiv.org
Large language models (LLMs) have revolutionized video-based computer vision
applications, including action recognition, anomaly detection, and video summarization …