- Academic Search

Expanding performance boundaries of open-source multimodal models with model, data, and test-time scaling

Z Chen, W Wang, Y Cao, Y Liu, Z Gao, E Cui… - arxiv preprint arxiv …, 2024 - arxiv.org

We introduce InternVL 2.5, an advanced multimodal large language model (MLLM) series
that builds upon InternVL 2.0, maintaining its core model architecture while introducing …

Save Cite Cited by 22 Related articles View as HTML

[Free GPT-4]

[PDF] arxiv.org

Deepseek-vl2: Mixture-of-experts vision-language models for advanced multimodal understanding

Z Wu, X Chen, Z Pan, X Liu, W Liu, D Dai… - arxiv preprint arxiv …, 2024 - arxiv.org

We present DeepSeek-VL2, an advanced series of large Mixture-of-Experts (MoE) Vision-
Language Models that significantly improves upon its predecessor, DeepSeek-VL, through …

Save Cite Cited by 8 Related articles All 2 versions Free GPT-4 View as HTML

[Free GPT-4]

[PDF] arxiv.org

Scaling laws for precision

T Kumar, Z Ankner, BF Spector, B Bordelon… - arxiv preprint arxiv …, 2024 - arxiv.org

Low precision training and inference affect both the quality and cost of language models, but
current scaling laws do not account for this. In this work, we devise" precision-aware" scaling …

Save Cite Cited by 10 Related articles View as HTML

[Free GPT-4]

[PDF] arxiv.org

Naturalbench: Evaluating vision-language models on natural adversarial samples

B Li, Z Lin, W Peng, JD Nyandwi, D Jiang, Z Ma… - arxiv preprint arxiv …, 2024 - arxiv.org

Vision-language models (VLMs) have made significant progress in recent visual-question-
answering (VQA) benchmarks that evaluate complex visio-linguistic reasoning. However …

Save Cite Cited by 6 Related articles All 2 versions Free GPT-4 View as HTML

[Free GPT-4]

[PDF] arxiv.org

MEGA-Bench: Scaling multimodal evaluation to over 500 real-world tasks

J Chen, T Liang, S Siu, Z Wang, K Wang… - arxiv preprint arxiv …, 2024 - arxiv.org

We present MEGA-Bench, an evaluation suite that scales multimodal evaluation to over 500
real-world tasks, to address the highly heterogeneous daily use cases of end users. Our …

Save Cite Cited by 3 Related articles View as HTML

[Free GPT-4]

[PDF] arxiv.org

Leopard: A vision language model for text-rich multi-image tasks

M Jia, W Yu, K Ma, T Fang, Z Zhang, S Ouyang… - arxiv preprint arxiv …, 2024 - arxiv.org

Text-rich images, where text serves as the central visual element guiding the overall
understanding, are prevalent in real-world applications, such as presentation slides …

Save Cite Cited by 3 Related articles All 3 versions Free GPT-4 View as HTML

[Free GPT-4]

[PDF] arxiv.org

Can foundation models actively gather information in interactive environments to test hypotheses?

NR Ke, DP Sawyer, H Soyer, M Engelcke… - arxiv preprint arxiv …, 2024 - arxiv.org

While problem solving is a standard evaluation task for foundation models, a crucial
component of problem solving--actively and strategically gathering information to test …

Save Cite Cited by 2 Related articles All 2 versions Free GPT-4 View as HTML

[Free GPT-4]

[PDF] arxiv.org

Worldcuisines: A massive-scale benchmark for multilingual and multicultural visual question answering on global cuisines

GI Winata, F Hudi, PA Irawan, D Anugraha… - arxiv preprint arxiv …, 2024 - arxiv.org

Vision Language Models (VLMs) often struggle with culture-specific knowledge, particularly
in languages other than English and in underrepresented cultural contexts. To evaluate their …

Save Cite Cited by 2 Related articles All 3 versions Free GPT-4 View as HTML

[Free GPT-4]

[PDF] arxiv.org

VLRewardBench: A Challenging Benchmark for Vision-Language Generative Reward Models

L Li, Y Wei, Z **e, X Yang, Y Song, P Wang… - arxiv preprint arxiv …, 2024 - arxiv.org

Vision-language generative reward models (VL-GenRMs) play a crucial role in aligning and
evaluating multimodal AI systems, yet their own evaluation remains under-explored. Current …

Save Cite Cited by 1 Related articles View as HTML

[Free GPT-4]

[PDF] arxiv.org

Probing Visual Language Priors in VLMs

T Luo, A Cao, G Lee, J Johnson, H Lee - arxiv preprint arxiv:2501.00569, 2024 - arxiv.org

Despite recent advances in Vision-Language Models (VLMs), many still over-rely on visual
language priors present in their training data rather than true visual reasoning. To examine …

Save Cite Related articles View as HTML

Create alert

Cite

Advanced search

Saved to My library

Molmo and pixmo: Open weights and open data for state-of-the-art multimodal models

Expanding performance boundaries of open-source multimodal models with model, data, and test-time scaling

Deepseek-vl2: Mixture-of-experts vision-language models for advanced multimodal understanding

Scaling laws for precision

Naturalbench: Evaluating vision-language models on natural adversarial samples

MEGA-Bench: Scaling multimodal evaluation to over 500 real-world tasks

Leopard: A vision language model for text-rich multi-image tasks

Can foundation models actively gather information in interactive environments to test hypotheses?

Worldcuisines: A massive-scale benchmark for multilingual and multicultural visual question answering on global cuisines

VLRewardBench: A Challenging Benchmark for Vision-Language Generative Reward Models

Probing Visual Language Priors in VLMs