Google Académico

Expanding performance boundaries of open-source multimodal models with model, data, and test-time scaling

Z Chen, W Wang, Y Cao, Y Liu, Z Gao, E Cui… - arxiv preprint arxiv …, 2024 - arxiv.org

We introduce InternVL 2.5, an advanced multimodal large language model (MLLM) series
that builds upon InternVL 2.0, maintaining its core model architecture while introducing …

Guardar Citar Citado por 22 Artículos relacionados Versión en HTML

[Free GPT-4]

[PDF] arxiv.org

Deepseek-vl2: Mixture-of-experts vision-language models for advanced multimodal understanding

Z Wu, X Chen, Z Pan, X Liu, W Liu, D Dai… - arxiv preprint arxiv …, 2024 - arxiv.org

We present DeepSeek-VL2, an advanced series of large Mixture-of-Experts (MoE) Vision-
Language Models that significantly improves upon its predecessor, DeepSeek-VL, through …

Guardar Citar Citado por 8 Artículos relacionados Las 2 versiones Versión en HTML

[Free GPT-4]

[PDF] arxiv.org

Insight-v: Exploring long-chain visual reasoning with multimodal large language models

Y Dong, Z Liu, HL Sun, J Yang, W Hu, Y Rao… - arxiv preprint arxiv …, 2024 - arxiv.org

Large Language Models (LLMs) demonstrate enhanced capabilities and reliability by
reasoning more, evolving from Chain-of-Thought prompting to product-level solutions like …

Guardar Citar Citado por 6 Artículos relacionados Versión en HTML

[Free GPT-4]

[PDF] arxiv.org

Ferret-ui 2: Mastering universal user interface understanding across platforms

Z Li, K You, H Zhang, D Feng, H Agrawal, X Li… - arxiv preprint arxiv …, 2024 - arxiv.org

Building a generalist model for user interface (UI) understanding is challenging due to
various foundational issues, such as platform diversity, resolution variation, and data …

Guardar Citar Citado por 8 Artículos relacionados Versión en HTML

[Free GPT-4]

[PDF] arxiv.org

Apollo: An exploration of video understanding in large multimodal models

O Zohar, X Wang, Y Dubois, N Mehta, T **ao… - arxiv preprint arxiv …, 2024 - arxiv.org

Despite the rapid integration of video perception capabilities into Large Multimodal Models
(LMMs), the underlying mechanisms driving their video understanding remain poorly …

Guardar Citar Citado por 6 Artículos relacionados Versión en HTML

[Free GPT-4]

[PDF] arxiv.org

POINTS1. 5: Building a Vision-Language Model towards Real World Applications

Y Liu, L Tian, X Zhou, X Gao, K Yu, Y Yu… - arxiv preprint arxiv …, 2024 - arxiv.org

Vision-language models have made significant strides recently, demonstrating superior
performance across a range of tasks, eg optical character recognition and complex diagram …

Guardar Citar Citado por 3 Artículos relacionados Versión en HTML

[Free GPT-4]

[PDF] arxiv.org

PaliGemma 2: A Family of Versatile VLMs for Transfer

A Steiner, AS Pinto, M Tschannen, D Keysers… - arxiv preprint arxiv …, 2024 - arxiv.org

PaliGemma 2 is an upgrade of the PaliGemma open Vision-Language Model (VLM) based
on the Gemma 2 family of language models. We combine the SigLIP-So400m vision …

Guardar Citar Citado por 1 Artículos relacionados Versión en HTML

[Free GPT-4]

[PDF] arxiv.org

DOGE: Towards Versatile Visual Document Grounding and Referring

Y Zhou, Y Chen, H Lin, S Yang, L Zhu, Z Qi… - arxiv preprint arxiv …, 2024 - arxiv.org

In recent years, Multimodal Large Language Models (MLLMs) have increasingly
emphasized grounding and referring capabilities to achieve detailed understanding and …

Guardar Citar Citado por 1 Artículos relacionados Versión en HTML

[Free GPT-4]

[PDF] arxiv.org

Ocean-OCR: Towards General OCR Application via a Vision-Language Model

S Chen, X Guo, Y Li, T Zhang, M Lin, D Kuang… - arxiv preprint arxiv …, 2025 - arxiv.org

Multimodal large language models (MLLMs) have shown impressive capabilities across
various domains, excelling in processing and understanding information from multiple …

Guardar Citar Artículos relacionados Versión en HTML

[Free GPT-4]

[PDF] arxiv.org

From interaction to impact: Towards safer ai agents through understanding and evaluating ui operation impacts

ZJ Zhang, E Schoop, J Nichols, A Mahajan… - arxiv preprint arxiv …, 2024 - arxiv.org

With advances in generative AI, there is increasing work towards creating autonomous
agents that can manage daily tasks by operating user interfaces (UIs). While prior research …

Guardar Citar Citado por 1 Artículos relacionados Versión en HTML

Crear alerta

Citar

Búsqueda avanzada

Guardado en Mi biblioteca

Mm1. 5: Methods, analysis & insights from multimodal llm fine-tuning

Expanding performance boundaries of open-source multimodal models with model, data, and test-time scaling

Deepseek-vl2: Mixture-of-experts vision-language models for advanced multimodal understanding

Insight-v: Exploring long-chain visual reasoning with multimodal large language models

Ferret-ui 2: Mastering universal user interface understanding across platforms

Apollo: An exploration of video understanding in large multimodal models

POINTS1. 5: Building a Vision-Language Model towards Real World Applications

PaliGemma 2: A Family of Versatile VLMs for Transfer

DOGE: Towards Versatile Visual Document Grounding and Referring

Ocean-OCR: Towards General OCR Application via a Vision-Language Model

From interaction to impact: Towards safer ai agents through understanding and evaluating ui operation impacts