- Academic Search

Z Chen, W Wang, H Tian, S Ye, Z Gao, E Cui… - Science China …, 2024 - Springer

In this paper, we introduce InternVL 1.5, an open-source multimodal large language model
(MLLM) to bridge the capability gap between open-source and proprietary commercial …

Speichern Zitieren Zitiert von: 351 Ähnliche Artikel Alle 2 Versionen

[Free GPT-4]

[PDF] arxiv.org

Mathvista: Evaluating mathematical reasoning of foundation models in visual contexts

P Lu, H Bansal, T ** mathematical reasoning for multimodal large language models

W Shi, Z Hu, Y Bin, J Liu, Y Yang, SK Ng, L Bing… - arxiv preprint arxiv …, 2024 - arxiv.org

Large language models (LLMs) have demonstrated impressive reasoning capabilities,
particularly in textual mathematical problem-solving. However, existing open-source image …

Speichern Zitieren Zitiert von: 39 Ähnliche Artikel Alle 3 Versionen HTML-Version

[Free GPT-4]

[PDF] arxiv.org

Expanding performance boundaries of open-source multimodal models with model, data, and test-time scaling

Z Chen, W Wang, Y Cao, Y Liu, Z Gao, E Cui… - arxiv preprint arxiv …, 2024 - arxiv.org

We introduce InternVL 2.5, an advanced multimodal large language model (MLLM) series
that builds upon InternVL 2.0, maintaining its core model architecture while introducing …

Speichern Zitieren Zitiert von: 25 Ähnliche Artikel HTML-Version

[Free GPT-4]

[PDF] arxiv.org

Nvlm: Open frontier-class multimodal llms

W Dai, N Lee, B Wang, Z Yang, Z Liu, J Barker… - arxiv preprint arxiv …, 2024 - arxiv.org

We introduce NVLM 1.0, a family of frontier-class multimodal large language models (LLMs)
that achieve state-of-the-art results on vision-language tasks, rivaling the leading proprietary …

Speichern Zitieren Zitiert von: 21 Ähnliche Artikel Alle 4 Versionen HTML-Version

[Free GPT-4]

[PDF] arxiv.org

Mminstruct: A high-quality multi-modal instruction tuning dataset with extensive diversity

Y Liu, Y Cao, Z Gao, W Wang, Z Chen, W Wang… - Science China …, 2024 - Springer

Despite the effectiveness of vision-language supervised fine-tuning in enhancing the
performance of vision large language models (VLLMs), existing visual instruction tuning …

Speichern Zitieren Zitiert von: 15 Ähnliche Artikel Alle 4 Versionen

[Free GPT-4]

[PDF] arxiv.org

Revision: Rendering tools enable spatial fidelity in vision-language models

A Chatterjee, Y Luo, T Gokhale, Y Yang… - European Conference on …, 2024 - Springer

Abstract Text-to-Image (T2I) and multimodal large language models (MLLMs) have been
adopted in solutions for several computer vision and multimodal learning tasks. However, it …

Speichern Zitieren Zitiert von: 2 Ähnliche Artikel Alle 9 Versionen

Alert erstellen

Zitieren

Erweiterte Suche

In „Meine Bibliothek“ gespeichert

Super-clevr: A virtual benchmark to diagnose domain robustness in visual reasoning

How far are we to gpt-4v? closing the gap to commercial multimodal models with open-source suites

Mathvista: Evaluating mathematical reasoning of foundation models in visual contexts

Expanding performance boundaries of open-source multimodal models with model, data, and test-time scaling

Nvlm: Open frontier-class multimodal llms

Mminstruct: A high-quality multi-modal instruction tuning dataset with extensive diversity

Revision: Rendering tools enable spatial fidelity in vision-language models