Expanding performance boundaries of open-source multimodal models with model, data, and test-time scaling
We introduce InternVL 2.5, an advanced multimodal large language model (MLLM) series
that builds upon InternVL 2.0, maintaining its core model architecture while introducing …
that builds upon InternVL 2.0, maintaining its core model architecture while introducing …
Foundations and recent trends in multimodal mobile agents: A survey
Mobile agents are essential for automating tasks in complex and dynamic mobile
environments. As foundation models evolve, the demands for agents that can adapt in real …
environments. As foundation models evolve, the demands for agents that can adapt in real …
Ferret-ui 2: Mastering universal user interface understanding across platforms
Building a generalist model for user interface (UI) understanding is challenging due to
various foundational issues, such as platform diversity, resolution variation, and data …
various foundational issues, such as platform diversity, resolution variation, and data …
Showui: One vision-language-action model for generalist gui agent
Graphical User Interface (GUI) automation holds significant promise for enhancing human
productivity by assisting with digital tasks. While recent Large Language Models (LLMs) and …
productivity by assisting with digital tasks. While recent Large Language Models (LLMs) and …
Os-atlas: A foundation action model for generalist gui agents
Existing efforts in building GUI agents heavily rely on the availability of robust commercial
Vision-Language Models (VLMs) such as GPT-4o and GeminiProVision. Practitioners are …
Vision-Language Models (VLMs) such as GPT-4o and GeminiProVision. Practitioners are …
[PDF][PDF] Llm-powered gui agents in phone automation: Surveying progress and prospects
With the rapid rise of large language models (LLMs), phone automation has undergone
transformative changes. This paper systematically reviews LLM-driven phone GUI agents …
transformative changes. This paper systematically reviews LLM-driven phone GUI agents …
Showui: One vision-language-action model for gui visual agent
Building Graphical User Interface (GUI) assistants holds significant promise for enhancing
human workflow productivity. While most agents are language-based, relying on closed …
human workflow productivity. While most agents are language-based, relying on closed …
Inferring Alt-text For UI Icons With Large Language Models During App Development
Ensuring accessibility in mobile applications remains a significant challenge, particularly for
visually impaired users who rely on screen readers. User interface icons are essential for …
visually impaired users who rely on screen readers. User interface icons are essential for …
Aguvis: Unified Pure Vision Agents for Autonomous GUI Interaction
Graphical User Interfaces (GUIs) are critical to human-computer interaction, yet automating
GUI tasks remains challenging due to the complexity and variability of visual environments …
GUI tasks remains challenging due to the complexity and variability of visual environments …
InfiGUIAgent: A Multimodal Generalist GUI Agent with Native Reasoning and Reflection
Graphical User Interface (GUI) Agents, powered by multimodal large language models
(MLLMs), have shown great potential for task automation on computing devices such as …
(MLLMs), have shown great potential for task automation on computing devices such as …