Expanding performance boundaries of open-source multimodal models with model, data, and test-time scaling
We introduce InternVL 2.5, an advanced multimodal large language model (MLLM) series
that builds upon InternVL 2.0, maintaining its core model architecture while introducing …
that builds upon InternVL 2.0, maintaining its core model architecture while introducing …
Deepseek-vl2: Mixture-of-experts vision-language models for advanced multimodal understanding
We present DeepSeek-VL2, an advanced series of large Mixture-of-Experts (MoE) Vision-
Language Models that significantly improves upon its predecessor, DeepSeek-VL, through …
Language Models that significantly improves upon its predecessor, DeepSeek-VL, through …
Insight-v: Exploring long-chain visual reasoning with multimodal large language models
Large Language Models (LLMs) demonstrate enhanced capabilities and reliability by
reasoning more, evolving from Chain-of-Thought prompting to product-level solutions like …
reasoning more, evolving from Chain-of-Thought prompting to product-level solutions like …
Ferret-ui 2: Mastering universal user interface understanding across platforms
Building a generalist model for user interface (UI) understanding is challenging due to
various foundational issues, such as platform diversity, resolution variation, and data …
various foundational issues, such as platform diversity, resolution variation, and data …
Apollo: An exploration of video understanding in large multimodal models
Despite the rapid integration of video perception capabilities into Large Multimodal Models
(LMMs), the underlying mechanisms driving their video understanding remain poorly …
(LMMs), the underlying mechanisms driving their video understanding remain poorly …
POINTS1. 5: Building a Vision-Language Model towards Real World Applications
Y Liu, L Tian, X Zhou, X Gao, K Yu, Y Yu… - arxiv preprint arxiv …, 2024 - arxiv.org
Vision-language models have made significant strides recently, demonstrating superior
performance across a range of tasks, eg optical character recognition and complex diagram …
performance across a range of tasks, eg optical character recognition and complex diagram …
PaliGemma 2: A Family of Versatile VLMs for Transfer
PaliGemma 2 is an upgrade of the PaliGemma open Vision-Language Model (VLM) based
on the Gemma 2 family of language models. We combine the SigLIP-So400m vision …
on the Gemma 2 family of language models. We combine the SigLIP-So400m vision …
DOGE: Towards Versatile Visual Document Grounding and Referring
In recent years, Multimodal Large Language Models (MLLMs) have increasingly
emphasized grounding and referring capabilities to achieve detailed understanding and …
emphasized grounding and referring capabilities to achieve detailed understanding and …
Ocean-OCR: Towards General OCR Application via a Vision-Language Model
S Chen, X Guo, Y Li, T Zhang, M Lin, D Kuang… - arxiv preprint arxiv …, 2025 - arxiv.org
Multimodal large language models (MLLMs) have shown impressive capabilities across
various domains, excelling in processing and understanding information from multiple …
various domains, excelling in processing and understanding information from multiple …
From interaction to impact: Towards safer ai agents through understanding and evaluating ui operation impacts
With advances in generative AI, there is increasing work towards creating autonomous
agents that can manage daily tasks by operating user interfaces (UIs). While prior research …
agents that can manage daily tasks by operating user interfaces (UIs). While prior research …