How far are we to gpt-4v? closing the gap to commercial multimodal models with open-source suites
In this paper, we introduce InternVL 1.5, an open-source multimodal large language model
(MLLM) to bridge the capability gap between open-source and proprietary commercial …
(MLLM) to bridge the capability gap between open-source and proprietary commercial …
Mathverse: Does your multi-modal llm truly see the diagrams in visual math problems?
The remarkable progress of Multi-modal Large Language Models (MLLMs) has gained
unparalleled attention. However, their capabilities in visual math problem-solving remain …
unparalleled attention. However, their capabilities in visual math problem-solving remain …
Vlmevalkit: An open-source toolkit for evaluating large multi-modality models
We present VLMEvalKit: an open-source toolkit for evaluating large multi-modality models
based on PyTorch. The toolkit aims to provide a user-friendly and comprehensive framework …
based on PyTorch. The toolkit aims to provide a user-friendly and comprehensive framework …
Mme-survey: A comprehensive survey on evaluation of multimodal llms
As a prominent direction of Artificial General Intelligence (AGI), Multimodal Large Language
Models (MLLMs) have garnered increased attention from both industry and academia …
Models (MLLMs) have garnered increased attention from both industry and academia …
Expanding performance boundaries of open-source multimodal models with model, data, and test-time scaling
We introduce InternVL 2.5, an advanced multimodal large language model (MLLM) series
that builds upon InternVL 2.0, maintaining its core model architecture while introducing …
that builds upon InternVL 2.0, maintaining its core model architecture while introducing …
Efficientqat: Efficient quantization-aware training for large language models
Large language models (LLMs) are crucial in modern natural language processing and
artificial intelligence. However, they face challenges in managing their significant memory …
artificial intelligence. However, they face challenges in managing their significant memory …
Deepseek-vl2: Mixture-of-experts vision-language models for advanced multimodal understanding
We present DeepSeek-VL2, an advanced series of large Mixture-of-Experts (MoE) Vision-
Language Models that significantly improves upon its predecessor, DeepSeek-VL, through …
Language Models that significantly improves upon its predecessor, DeepSeek-VL, through …
Gmai-mmbench: A comprehensive multimodal evaluation benchmark towards general medical ai
Large Vision-Language Models (LVLMs) are capable of handling diverse data types such as
imaging, text, and physiological signals, and can be applied in various fields. In the medical …
imaging, text, and physiological signals, and can be applied in various fields. In the medical …
Multimodal self-instruct: Synthetic abstract image and visual reasoning instruction using language model
Although most current large multimodal models (LMMs) can already understand photos of
natural scenes and portraits, their understanding of abstract images, eg, charts, maps, or …
natural scenes and portraits, their understanding of abstract images, eg, charts, maps, or …
Et bench: Towards open-ended event-level video-language understanding
Recent advances in Video Large Language Models (Video-LLMs) have demonstrated their
great potential in general-purpose video understanding. To verify the significance of these …
great potential in general-purpose video understanding. To verify the significance of these …