How far are we to gpt-4v? closing the gap to commercial multimodal models with open-source suites

Z Chen, W Wang, H Tian, S Ye, Z Gao, E Cui… - Science China …, 2024 - Springer
In this paper, we introduce InternVL 1.5, an open-source multimodal large language model
(MLLM) to bridge the capability gap between open-source and proprietary commercial …

Mathvista: Evaluating mathematical reasoning of foundation models in visual contexts

P Lu, H Bansal, T ** mathematical reasoning for multimodal large language models
W Shi, Z Hu, Y Bin, J Liu, Y Yang, SK Ng, L Bing… - arxiv preprint arxiv …, 2024 - arxiv.org
Large language models (LLMs) have demonstrated impressive reasoning capabilities,
particularly in textual mathematical problem-solving. However, existing open-source image …

Expanding performance boundaries of open-source multimodal models with model, data, and test-time scaling

Z Chen, W Wang, Y Cao, Y Liu, Z Gao, E Cui… - arxiv preprint arxiv …, 2024 - arxiv.org
We introduce InternVL 2.5, an advanced multimodal large language model (MLLM) series
that builds upon InternVL 2.0, maintaining its core model architecture while introducing …

Nvlm: Open frontier-class multimodal llms

W Dai, N Lee, B Wang, Z Yang, Z Liu, J Barker… - arxiv preprint arxiv …, 2024 - arxiv.org
We introduce NVLM 1.0, a family of frontier-class multimodal large language models (LLMs)
that achieve state-of-the-art results on vision-language tasks, rivaling the leading proprietary …

Mminstruct: A high-quality multi-modal instruction tuning dataset with extensive diversity

Y Liu, Y Cao, Z Gao, W Wang, Z Chen, W Wang… - Science China …, 2024 - Springer
Despite the effectiveness of vision-language supervised fine-tuning in enhancing the
performance of vision large language models (VLLMs), existing visual instruction tuning …

Revision: Rendering tools enable spatial fidelity in vision-language models

A Chatterjee, Y Luo, T Gokhale, Y Yang… - European Conference on …, 2024 - Springer
Abstract Text-to-Image (T2I) and multimodal large language models (MLLMs) have been
adopted in solutions for several computer vision and multimodal learning tasks. However, it …