Expanding performance boundaries of open-source multimodal models with model, data, and test-time scaling

Z Chen, W Wang, Y Cao, Y Liu, Z Gao, E Cui… - arxiv preprint arxiv …, 2024 - arxiv.org
We introduce InternVL 2.5, an advanced multimodal large language model (MLLM) series
that builds upon InternVL 2.0, maintaining its core model architecture while introducing …

Deepseek-vl2: Mixture-of-experts vision-language models for advanced multimodal understanding

Z Wu, X Chen, Z Pan, X Liu, W Liu, D Dai… - arxiv preprint arxiv …, 2024 - arxiv.org
We present DeepSeek-VL2, an advanced series of large Mixture-of-Experts (MoE) Vision-
Language Models that significantly improves upon its predecessor, DeepSeek-VL, through …

Insight-v: Exploring long-chain visual reasoning with multimodal large language models

Y Dong, Z Liu, HL Sun, J Yang, W Hu, Y Rao… - arxiv preprint arxiv …, 2024 - arxiv.org
Large Language Models (LLMs) demonstrate enhanced capabilities and reliability by
reasoning more, evolving from Chain-of-Thought prompting to product-level solutions like …

Ferret-ui 2: Mastering universal user interface understanding across platforms

Z Li, K You, H Zhang, D Feng, H Agrawal, X Li… - arxiv preprint arxiv …, 2024 - arxiv.org
Building a generalist model for user interface (UI) understanding is challenging due to
various foundational issues, such as platform diversity, resolution variation, and data …

Apollo: An exploration of video understanding in large multimodal models

O Zohar, X Wang, Y Dubois, N Mehta, T **ao… - arxiv preprint arxiv …, 2024 - arxiv.org
Despite the rapid integration of video perception capabilities into Large Multimodal Models
(LMMs), the underlying mechanisms driving their video understanding remain poorly …

POINTS1. 5: Building a Vision-Language Model towards Real World Applications

Y Liu, L Tian, X Zhou, X Gao, K Yu, Y Yu… - arxiv preprint arxiv …, 2024 - arxiv.org
Vision-language models have made significant strides recently, demonstrating superior
performance across a range of tasks, eg optical character recognition and complex diagram …

PaliGemma 2: A Family of Versatile VLMs for Transfer

A Steiner, AS Pinto, M Tschannen, D Keysers… - arxiv preprint arxiv …, 2024 - arxiv.org
PaliGemma 2 is an upgrade of the PaliGemma open Vision-Language Model (VLM) based
on the Gemma 2 family of language models. We combine the SigLIP-So400m vision …

DOGE: Towards Versatile Visual Document Grounding and Referring

Y Zhou, Y Chen, H Lin, S Yang, L Zhu, Z Qi… - arxiv preprint arxiv …, 2024 - arxiv.org
In recent years, Multimodal Large Language Models (MLLMs) have increasingly
emphasized grounding and referring capabilities to achieve detailed understanding and …

Ocean-OCR: Towards General OCR Application via a Vision-Language Model

S Chen, X Guo, Y Li, T Zhang, M Lin, D Kuang… - arxiv preprint arxiv …, 2025 - arxiv.org
Multimodal large language models (MLLMs) have shown impressive capabilities across
various domains, excelling in processing and understanding information from multiple …

From interaction to impact: Towards safer ai agents through understanding and evaluating ui operation impacts

ZJ Zhang, E Schoop, J Nichols, A Mahajan… - arxiv preprint arxiv …, 2024 - arxiv.org
With advances in generative AI, there is increasing work towards creating autonomous
agents that can manage daily tasks by operating user interfaces (UIs). While prior research …