Expanding performance boundaries of open-source multimodal models with model, data, and test-time scaling

Z Chen, W Wang, Y Cao, Y Liu, Z Gao, E Cui… - arxiv preprint arxiv …, 2024 - arxiv.org
We introduce InternVL 2.5, an advanced multimodal large language model (MLLM) series
that builds upon InternVL 2.0, maintaining its core model architecture while introducing …

Deepseek-vl2: Mixture-of-experts vision-language models for advanced multimodal understanding

Z Wu, X Chen, Z Pan, X Liu, W Liu, D Dai… - arxiv preprint arxiv …, 2024 - arxiv.org
We present DeepSeek-VL2, an advanced series of large Mixture-of-Experts (MoE) Vision-
Language Models that significantly improves upon its predecessor, DeepSeek-VL, through …

Scaling laws for precision

T Kumar, Z Ankner, BF Spector, B Bordelon… - arxiv preprint arxiv …, 2024 - arxiv.org
Low precision training and inference affect both the quality and cost of language models, but
current scaling laws do not account for this. In this work, we devise" precision-aware" scaling …

Naturalbench: Evaluating vision-language models on natural adversarial samples

B Li, Z Lin, W Peng, JD Nyandwi, D Jiang, Z Ma… - arxiv preprint arxiv …, 2024 - arxiv.org
Vision-language models (VLMs) have made significant progress in recent visual-question-
answering (VQA) benchmarks that evaluate complex visio-linguistic reasoning. However …

MEGA-Bench: Scaling multimodal evaluation to over 500 real-world tasks

J Chen, T Liang, S Siu, Z Wang, K Wang… - arxiv preprint arxiv …, 2024 - arxiv.org
We present MEGA-Bench, an evaluation suite that scales multimodal evaluation to over 500
real-world tasks, to address the highly heterogeneous daily use cases of end users. Our …

Leopard: A vision language model for text-rich multi-image tasks

M Jia, W Yu, K Ma, T Fang, Z Zhang, S Ouyang… - arxiv preprint arxiv …, 2024 - arxiv.org
Text-rich images, where text serves as the central visual element guiding the overall
understanding, are prevalent in real-world applications, such as presentation slides …

Can foundation models actively gather information in interactive environments to test hypotheses?

NR Ke, DP Sawyer, H Soyer, M Engelcke… - arxiv preprint arxiv …, 2024 - arxiv.org
While problem solving is a standard evaluation task for foundation models, a crucial
component of problem solving--actively and strategically gathering information to test …

Worldcuisines: A massive-scale benchmark for multilingual and multicultural visual question answering on global cuisines

GI Winata, F Hudi, PA Irawan, D Anugraha… - arxiv preprint arxiv …, 2024 - arxiv.org
Vision Language Models (VLMs) often struggle with culture-specific knowledge, particularly
in languages other than English and in underrepresented cultural contexts. To evaluate their …

VLRewardBench: A Challenging Benchmark for Vision-Language Generative Reward Models

L Li, Y Wei, Z **e, X Yang, Y Song, P Wang… - arxiv preprint arxiv …, 2024 - arxiv.org
Vision-language generative reward models (VL-GenRMs) play a crucial role in aligning and
evaluating multimodal AI systems, yet their own evaluation remains under-explored. Current …

Probing Visual Language Priors in VLMs

T Luo, A Cao, G Lee, J Johnson, H Lee - arxiv preprint arxiv:2501.00569, 2024 - arxiv.org
Despite recent advances in Vision-Language Models (VLMs), many still over-rely on visual
language priors present in their training data rather than true visual reasoning. To examine …