- Academic Search

Expanding performance boundaries of open-source multimodal models with model, data, and test-time scaling

Z Chen, W Wang, Y Cao, Y Liu, Z Gao, E Cui… - arxiv preprint arxiv …, 2024 - arxiv.org

We introduce InternVL 2.5, an advanced multimodal large language model (MLLM) series
that builds upon InternVL 2.0, maintaining its core model architecture while introducing …

Uložit Citovat Počet citací tohoto článku: 35 Související články Všechny verze (počet: 2) Zobrazit jako HTML

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

Scaling laws for precision

T Kumar, Z Ankner, BF Spector, B Bordelon… - arxiv preprint arxiv …, 2024 - arxiv.org

Low precision training and inference affect both the quality and cost of language models, but
current scaling laws do not account for this. In this work, we devise" precision-aware" scaling …

Uložit Citovat Počet citací tohoto článku: 17 Související články Všechny verze (počet: 3) Zobrazit jako HTML

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

Deepseek-vl2: Mixture-of-experts vision-language models for advanced multimodal understanding

Z Wu, X Chen, Z Pan, X Liu, W Liu, D Dai… - arxiv preprint arxiv …, 2024 - arxiv.org

We present DeepSeek-VL2, an advanced series of large Mixture-of-Experts (MoE) Vision-
Language Models that significantly improves upon its predecessor, DeepSeek-VL, through …

Uložit Citovat Počet citací tohoto článku: 10 Související články Všechny verze (počet: 3) Zobrazit jako HTML

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

Naturalbench: Evaluating vision-language models on natural adversarial samples

B Li, Z Lin, W Peng, JD Nyandwi, D Jiang, Z Ma… - arxiv preprint arxiv …, 2024 - arxiv.org

Vision-language models (VLMs) have made significant progress in recent visual-question-
answering (VQA) benchmarks that evaluate complex visio-linguistic reasoning. However …

Uložit Citovat Počet citací tohoto článku: 8 Související články Všechny verze (počet: 3) Zobrazit jako HTML

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

MEGA-Bench: Scaling multimodal evaluation to over 500 real-world tasks

J Chen, T Liang, S Siu, Z Wang, K Wang… - arxiv preprint arxiv …, 2024 - arxiv.org

We present MEGA-Bench, an evaluation suite that scales multimodal evaluation to over 500
real-world tasks, to address the highly heterogeneous daily use cases of end users. Our …

Uložit Citovat Počet citací tohoto článku: 3 Související články Všechny verze (počet: 2) Zobrazit jako HTML

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

Leopard: A vision language model for text-rich multi-image tasks

M Jia, W Yu, K Ma, T Fang, Z Zhang, S Ouyang… - arxiv preprint arxiv …, 2024 - arxiv.org

Text-rich images, where text serves as the central visual element guiding the overall
understanding, are prevalent in real-world applications, such as presentation slides …

Uložit Citovat Počet citací tohoto článku: 3 Související články Všechny verze (počet: 4) Zobrazit jako HTML

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

Worldcuisines: A massive-scale benchmark for multilingual and multicultural visual question answering on global cuisines

GI Winata, F Hudi, PA Irawan, D Anugraha… - arxiv preprint arxiv …, 2024 - arxiv.org

Vision Language Models (VLMs) often struggle with culture-specific knowledge, particularly
in languages other than English and in underrepresented cultural contexts. To evaluate their …

Uložit Citovat Počet citací tohoto článku: 3 Související články Všechny verze (počet: 3) Zobrazit jako HTML

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

Can foundation models actively gather information in interactive environments to test hypotheses?

NR Ke, DP Sawyer, H Soyer, M Engelcke… - arxiv preprint arxiv …, 2024 - arxiv.org

While problem solving is a standard evaluation task for foundation models, a crucial
component of problem solving--actively and strategically gathering information to test …

Uložit Citovat Počet citací tohoto článku: 2 Související články Všechny verze (počet: 3) Zobrazit jako HTML

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

VLRewardBench: A Challenging Benchmark for Vision-Language Generative Reward Models

L Li, Y Wei, Z **e, X Yang, Y Song, P Wang… - arxiv preprint arxiv …, 2024 - arxiv.org

Vision-language generative reward models (VL-GenRMs) play a crucial role in aligning and
evaluating multimodal AI systems, yet their own evaluation remains under-explored. Current …

Uložit Citovat Počet citací tohoto článku: 2 Související články Všechny verze (počet: 2) Zobrazit jako HTML

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

What is missing in multilingual visual reasoning and how to fix it

Y Song, S Khanuja, G Neubig - arxiv preprint arxiv:2403.01404, 2024 - arxiv.org

NLP models today strive for supporting multiple languages and modalities, improving
accessibility for diverse users. In this paper, we evaluate their multilingual, multimodal …

Uložit Citovat Počet citací tohoto článku: 2 Související články Všechny verze (počet: 5) Zobrazit jako HTML

Vytvořit upozornění

Citovat

Rozšířené vyhledávání

Uloženo do Mojí knihovny

Molmo and pixmo: Open weights and open data for state-of-the-art multimodal models

Expanding performance boundaries of open-source multimodal models with model, data, and test-time scaling

Scaling laws for precision

Deepseek-vl2: Mixture-of-experts vision-language models for advanced multimodal understanding

Naturalbench: Evaluating vision-language models on natural adversarial samples

MEGA-Bench: Scaling multimodal evaluation to over 500 real-world tasks

Leopard: A vision language model for text-rich multi-image tasks

Worldcuisines: A massive-scale benchmark for multilingual and multicultural visual question answering on global cuisines

Can foundation models actively gather information in interactive environments to test hypotheses?

VLRewardBench: A Challenging Benchmark for Vision-Language Generative Reward Models

What is missing in multilingual visual reasoning and how to fix it