How far are we to gpt-4v? closing the gap to commercial multimodal models with open-source suites

Z Chen, W Wang, H Tian, S Ye, Z Gao, E Cui… - Science China …, 2024 - Springer
In this paper, we introduce InternVL 1.5, an open-source multimodal large language model
(MLLM) to bridge the capability gap between open-source and proprietary commercial …

Mathverse: Does your multi-modal llm truly see the diagrams in visual math problems?

R Zhang, D Jiang, Y Zhang, H Lin, Z Guo, P Qiu… - … on Computer Vision, 2024 - Springer
The remarkable progress of Multi-modal Large Language Models (MLLMs) has gained
unparalleled attention. However, their capabilities in visual math problem-solving remain …

Vlmevalkit: An open-source toolkit for evaluating large multi-modality models

H Duan, J Yang, Y Qiao, X Fang, L Chen, Y Liu… - Proceedings of the …, 2024 - dl.acm.org
We present VLMEvalKit: an open-source toolkit for evaluating large multi-modality models
based on PyTorch. The toolkit aims to provide a user-friendly and comprehensive framework …

Mme-survey: A comprehensive survey on evaluation of multimodal llms

C Fu, YF Zhang, S Yin, B Li, X Fang, S Zhao… - arxiv preprint arxiv …, 2024 - arxiv.org
As a prominent direction of Artificial General Intelligence (AGI), Multimodal Large Language
Models (MLLMs) have garnered increased attention from both industry and academia …

Expanding performance boundaries of open-source multimodal models with model, data, and test-time scaling

Z Chen, W Wang, Y Cao, Y Liu, Z Gao, E Cui… - arxiv preprint arxiv …, 2024 - arxiv.org
We introduce InternVL 2.5, an advanced multimodal large language model (MLLM) series
that builds upon InternVL 2.0, maintaining its core model architecture while introducing …

Efficientqat: Efficient quantization-aware training for large language models

M Chen, W Shao, P Xu, J Wang, P Gao… - arxiv preprint arxiv …, 2024 - arxiv.org
Large language models (LLMs) are crucial in modern natural language processing and
artificial intelligence. However, they face challenges in managing their significant memory …

Deepseek-vl2: Mixture-of-experts vision-language models for advanced multimodal understanding

Z Wu, X Chen, Z Pan, X Liu, W Liu, D Dai… - arxiv preprint arxiv …, 2024 - arxiv.org
We present DeepSeek-VL2, an advanced series of large Mixture-of-Experts (MoE) Vision-
Language Models that significantly improves upon its predecessor, DeepSeek-VL, through …

Gmai-mmbench: A comprehensive multimodal evaluation benchmark towards general medical ai

P Chen, J Ye, G Wang, Y Li, Z Deng, W Li, T Li… - arxiv preprint arxiv …, 2024 - arxiv.org
Large Vision-Language Models (LVLMs) are capable of handling diverse data types such as
imaging, text, and physiological signals, and can be applied in various fields. In the medical …

Multimodal self-instruct: Synthetic abstract image and visual reasoning instruction using language model

W Zhang, Z Cheng, Y He, M Wang, Y Shen… - arxiv preprint arxiv …, 2024 - arxiv.org
Although most current large multimodal models (LMMs) can already understand photos of
natural scenes and portraits, their understanding of abstract images, eg, charts, maps, or …

Et bench: Towards open-ended event-level video-language understanding

Y Liu, Z Ma, Z Qi, Y Wu, Y Shan, CW Chen - arxiv preprint arxiv …, 2024 - arxiv.org
Recent advances in Video Large Language Models (Video-LLMs) have demonstrated their
great potential in general-purpose video understanding. To verify the significance of these …