How far are we to gpt-4v? closing the gap to commercial multimodal models with open-source suites
In this paper, we introduce InternVL 1.5, an open-source multimodal large language model
(MLLM) to bridge the capability gap between open-source and proprietary commercial …
(MLLM) to bridge the capability gap between open-source and proprietary commercial …
Expanding performance boundaries of open-source multimodal models with model, data, and test-time scaling
We introduce InternVL 2.5, an advanced multimodal large language model (MLLM) series
that builds upon InternVL 2.0, maintaining its core model architecture while introducing …
that builds upon InternVL 2.0, maintaining its core model architecture while introducing …
Mminstruct: A high-quality multi-modal instruction tuning dataset with extensive diversity
Despite the effectiveness of vision-language supervised fine-tuning in enhancing the
performance of vision large language models (VLLMs), existing visual instruction tuning …
performance of vision large language models (VLLMs), existing visual instruction tuning …
Densefusion-1m: Merging vision experts for comprehensive multimodal perception
Existing Multimodal Large Language Models (MLLMs) increasingly emphasize complex
understanding of various visual elements, including multiple objects, text information, and …
understanding of various visual elements, including multiple objects, text information, and …
A survey on evaluation of multimodal large language models
J Huang, J Zhang - arxiv preprint arxiv:2408.15769, 2024 - arxiv.org
Multimodal Large Language Models (MLLMs) mimic human perception and reasoning
system by integrating powerful Large Language Models (LLMs) with various modality …
system by integrating powerful Large Language Models (LLMs) with various modality …
Deepseek-vl2: Mixture-of-experts vision-language models for advanced multimodal understanding
We present DeepSeek-VL2, an advanced series of large Mixture-of-Experts (MoE) Vision-
Language Models that significantly improves upon its predecessor, DeepSeek-VL, through …
Language Models that significantly improves upon its predecessor, DeepSeek-VL, through …
Mini-InternVL: a flexible-transfer pocket multi-modal model with 5% parameters and 90% performance
Multi-modal large language models (MLLMs) have demonstrated impressive performance in
vision-language tasks across a wide range of domains. However, the large model scale and …
vision-language tasks across a wide range of domains. However, the large model scale and …
Enhancing the reasoning ability of multimodal large language models via mixed preference optimization
Existing open-source multimodal large language models (MLLMs) generally follow a
training process involving pre-training and supervised fine-tuning. However, these models …
training process involving pre-training and supervised fine-tuning. However, these models …
The curse of multi-modalities: Evaluating hallucinations of large multimodal models across language, visual, and audio
Recent advancements in large multimodal models (LMMs) have significantly enhanced
performance across diverse tasks, with ongoing efforts to further integrate additional …
performance across diverse tasks, with ongoing efforts to further integrate additional …
A survey on multimodal benchmarks: In the era of large ai models
The rapid evolution of Multimodal Large Language Models (MLLMs) has brought substantial
advancements in artificial intelligence, significantly enhancing the capability to understand …
advancements in artificial intelligence, significantly enhancing the capability to understand …