Expanding performance boundaries of open-source multimodal models with model, data, and test-time scaling
We introduce InternVL 2.5, an advanced multimodal large language model (MLLM) series
that builds upon InternVL 2.0, maintaining its core model architecture while introducing …
that builds upon InternVL 2.0, maintaining its core model architecture while introducing …
Deepseek-vl2: Mixture-of-experts vision-language models for advanced multimodal understanding
We present DeepSeek-VL2, an advanced series of large Mixture-of-Experts (MoE) Vision-
Language Models that significantly improves upon its predecessor, DeepSeek-VL, through …
Language Models that significantly improves upon its predecessor, DeepSeek-VL, through …
Scaling laws for precision
Low precision training and inference affect both the quality and cost of language models, but
current scaling laws do not account for this. In this work, we devise" precision-aware" scaling …
current scaling laws do not account for this. In this work, we devise" precision-aware" scaling …
Naturalbench: Evaluating vision-language models on natural adversarial samples
Vision-language models (VLMs) have made significant progress in recent visual-question-
answering (VQA) benchmarks that evaluate complex visio-linguistic reasoning. However …
answering (VQA) benchmarks that evaluate complex visio-linguistic reasoning. However …
MEGA-Bench: Scaling multimodal evaluation to over 500 real-world tasks
We present MEGA-Bench, an evaluation suite that scales multimodal evaluation to over 500
real-world tasks, to address the highly heterogeneous daily use cases of end users. Our …
real-world tasks, to address the highly heterogeneous daily use cases of end users. Our …
Leopard: A vision language model for text-rich multi-image tasks
Text-rich images, where text serves as the central visual element guiding the overall
understanding, are prevalent in real-world applications, such as presentation slides …
understanding, are prevalent in real-world applications, such as presentation slides …
Can foundation models actively gather information in interactive environments to test hypotheses?
While problem solving is a standard evaluation task for foundation models, a crucial
component of problem solving--actively and strategically gathering information to test …
component of problem solving--actively and strategically gathering information to test …
Worldcuisines: A massive-scale benchmark for multilingual and multicultural visual question answering on global cuisines
Vision Language Models (VLMs) often struggle with culture-specific knowledge, particularly
in languages other than English and in underrepresented cultural contexts. To evaluate their …
in languages other than English and in underrepresented cultural contexts. To evaluate their …
VLRewardBench: A Challenging Benchmark for Vision-Language Generative Reward Models
Vision-language generative reward models (VL-GenRMs) play a crucial role in aligning and
evaluating multimodal AI systems, yet their own evaluation remains under-explored. Current …
evaluating multimodal AI systems, yet their own evaluation remains under-explored. Current …
Probing Visual Language Priors in VLMs
Despite recent advances in Vision-Language Models (VLMs), many still over-rely on visual
language priors present in their training data rather than true visual reasoning. To examine …
language priors present in their training data rather than true visual reasoning. To examine …