Expanding performance boundaries of open-source multimodal models with model, data, and test-time scaling

Z Chen, W Wang, Y Cao, Y Liu, Z Gao, E Cui… - arxiv preprint arxiv …, 2024 - arxiv.org
We introduce InternVL 2.5, an advanced multimodal large language model (MLLM) series
that builds upon InternVL 2.0, maintaining its core model architecture while introducing …

Foundations and recent trends in multimodal mobile agents: A survey

B Wu, Y Li, M Fang, Z Song, Z Zhang, Y Wei… - arxiv preprint arxiv …, 2024 - arxiv.org
Mobile agents are essential for automating tasks in complex and dynamic mobile
environments. As foundation models evolve, the demands for agents that can adapt in real …

Ferret-ui 2: Mastering universal user interface understanding across platforms

Z Li, K You, H Zhang, D Feng, H Agrawal, X Li… - arxiv preprint arxiv …, 2024 - arxiv.org
Building a generalist model for user interface (UI) understanding is challenging due to
various foundational issues, such as platform diversity, resolution variation, and data …

Showui: One vision-language-action model for generalist gui agent

KQ Lin, L Li, D Gao, Z Yang, Z Bai, W Lei… - … 2024 Workshop on …, 2024 - openreview.net
Graphical User Interface (GUI) automation holds significant promise for enhancing human
productivity by assisting with digital tasks. While recent Large Language Models (LLMs) and …

Os-atlas: A foundation action model for generalist gui agents

Z Wu, Z Wu, F Xu, Y Wang, Q Sun, C Jia… - arxiv preprint arxiv …, 2024 - arxiv.org
Existing efforts in building GUI agents heavily rely on the availability of robust commercial
Vision-Language Models (VLMs) such as GPT-4o and GeminiProVision. Practitioners are …

[PDF][PDF] Llm-powered gui agents in phone automation: Surveying progress and prospects

W Liu, L Liu, Y Guo, H **ao, W Lin, Y Chai, S Ren… - 2025 - preprints.org
With the rapid rise of large language models (LLMs), phone automation has undergone
transformative changes. This paper systematically reviews LLM-driven phone GUI agents …

Showui: One vision-language-action model for gui visual agent

KQ Lin, L Li, D Gao, Z Yang, S Wu, Z Bai, W Lei… - arxiv preprint arxiv …, 2024 - arxiv.org
Building Graphical User Interface (GUI) assistants holds significant promise for enhancing
human workflow productivity. While most agents are language-based, relying on closed …

Inferring Alt-text For UI Icons With Large Language Models During App Development

S Haque, C Csallner - arxiv preprint arxiv:2409.18060, 2024 - arxiv.org
Ensuring accessibility in mobile applications remains a significant challenge, particularly for
visually impaired users who rely on screen readers. User interface icons are essential for …

Aguvis: Unified Pure Vision Agents for Autonomous GUI Interaction

Y Xu, Z Wang, J Wang, D Lu, T **e, A Saha… - arxiv preprint arxiv …, 2024 - arxiv.org
Graphical User Interfaces (GUIs) are critical to human-computer interaction, yet automating
GUI tasks remains challenging due to the complexity and variability of visual environments …

InfiGUIAgent: A Multimodal Generalist GUI Agent with Native Reasoning and Reflection

Y Liu, P Li, Z Wei, C **e, X Hu, X Xu, S Zhang… - arxiv preprint arxiv …, 2025 - arxiv.org
Graphical User Interface (GUI) Agents, powered by multimodal large language models
(MLLMs), have shown great potential for task automation on computing devices such as …