Deepseek-vl: towards real-world vision-language understanding

H Lu, W Liu, B Zhang, B Wang, K Dong, B Liu… - arxiv preprint arxiv …, 2024 - arxiv.org
We present DeepSeek-VL, an open-source Vision-Language (VL) Model designed for real-
world vision and language understanding applications. Our approach is structured around …

Mme-survey: A comprehensive survey on evaluation of multimodal llms

C Fu, YF Zhang, S Yin, B Li, X Fang, S Zhao… - arxiv preprint arxiv …, 2024 - arxiv.org
As a prominent direction of Artificial General Intelligence (AGI), Multimodal Large Language
Models (MLLMs) have garnered increased attention from both industry and academia …

A survey on evaluation of multimodal large language models

J Huang, J Zhang - arxiv preprint arxiv:2408.15769, 2024 - arxiv.org
Multimodal Large Language Models (MLLMs) mimic human perception and reasoning
system by integrating powerful Large Language Models (LLMs) with various modality …

A survey on multimodal benchmarks: In the era of large ai models

L Li, G Chen, H Shi, J **ao, L Chen - arxiv preprint arxiv:2409.18142, 2024 - arxiv.org
The rapid evolution of Multimodal Large Language Models (MLLMs) has brought substantial
advancements in artificial intelligence, significantly enhancing the capability to understand …

FuseGPT: Learnable Layers Fusion of Generative Pre-trained Transformers

Z Pei, HL Zhen, X Yu, SJ Pan, M Yuan, B Yu - arxiv preprint arxiv …, 2024 - arxiv.org
Generative Pre-trained Transformers (GPTs) have demonstrated remarkable performance
across diverse domains through the extensive scaling of model parameters. Recent works …

Training on the Benchmark Is Not All You Need

S Ni, X Kong, C Li, X Hu, R Xu, J Zhu… - arxiv preprint arxiv …, 2024 - arxiv.org
The success of Large Language Models (LLMs) relies heavily on the huge amount of pre-
training data learned in the pre-training phase. The opacity of the pre-training process and …

Multi-label cluster discrimination for visual representation learning

X An, K Yang, X Dai, Z Feng, J Deng - European Conference on Computer …, 2024 - Springer
Abstract Contrastive Language Image Pre-training (CLIP) has recently demonstrated
success across various tasks due to superior feature representation empowered by image …

Position: Multimodal Large Language Models Can Significantly Advance Scientific Reasoning

Y Yan, S Wang, J Huo, J Ye, Z Chu, X Hu… - arxiv preprint arxiv …, 2025 - arxiv.org
Scientific reasoning, the process through which humans apply logic, evidence, and critical
thinking to explore and interpret scientific phenomena, is essential in advancing knowledge …