Expanding performance boundaries of open-source multimodal models with model, data, and test-time scaling

Z Chen, W Wang, Y Cao, Y Liu, Z Gao, E Cui… - arxiv preprint arxiv …, 2024 - arxiv.org
We introduce InternVL 2.5, an advanced multimodal large language model (MLLM) series
that builds upon InternVL 2.0, maintaining its core model architecture while introducing …

Android in the zoo: Chain-of-action-thought for gui agents

J Zhang, J Wu, Y Teng, M Liao, N Xu, X **ao… - arxiv preprint arxiv …, 2024 - arxiv.org
Large language model (LLM) leads to a surge of autonomous GUI agents for smartphone,
which completes a task triggered by natural language through predicting a sequence of …

Gui agents with foundation models: A comprehensive survey

S Wang, W Liu, J Chen, Y Zhou, W Gan, X Zeng… - arxiv preprint arxiv …, 2024 - arxiv.org
Recent advances in foundation models, particularly Large Language Models (LLMs) and
Multimodal Large Language Models (MLLMs), have facilitated the development of intelligent …

Foundations and recent trends in multimodal mobile agents: A survey

B Wu, Y Li, M Fang, Z Song, Z Zhang, Y Wei… - arxiv preprint arxiv …, 2024 - arxiv.org
Mobile agents are essential for automating tasks in complex and dynamic mobile
environments. As foundation models evolve, the demands for agents that can adapt in real …

Mmiu: Multimodal multi-image understanding for evaluating large vision-language models

F Meng, J Wang, C Li, Q Lu, H Tian, J Liao… - arxiv preprint arxiv …, 2024 - arxiv.org
The capability to process multiple images is crucial for Large Vision-Language Models
(LVLMs) to develop a more thorough and nuanced understanding of a scene. Recent multi …

Ferret-ui 2: Mastering universal user interface understanding across platforms

Z Li, K You, H Zhang, D Feng, H Agrawal, X Li… - arxiv preprint arxiv …, 2024 - arxiv.org
Building a generalist model for user interface (UI) understanding is challenging due to
various foundational issues, such as platform diversity, resolution variation, and data …

Os-atlas: A foundation action model for generalist gui agents

Z Wu, Z Wu, F Xu, Y Wang, Q Sun, C Jia… - arxiv preprint arxiv …, 2024 - arxiv.org
Existing efforts in building GUI agents heavily rely on the availability of robust commercial
Vision-Language Models (VLMs) such as GPT-4o and GeminiProVision. Practitioners are …

Showui: One vision-language-action model for generalist gui agent

KQ Lin, L Li, D Gao, Z Yang, Z Bai, W Lei… - … 2024 Workshop on …, 2024 - openreview.net
Graphical User Interface (GUI) automation holds significant promise for enhancing human
productivity by assisting with digital tasks. While recent Large Language Models (LLMs) and …

Generalist virtual agents: A survey on autonomous agents across digital platforms

M Gao, W Bu, B Miao, Y Wu, Y Li, J Li, S Tang… - arxiv preprint arxiv …, 2024 - arxiv.org
In this paper, we introduce the Generalist Virtual Agent (GVA), an autonomous entity
engineered to function across diverse digital platforms and environments, assisting users by …

Showui: One vision-language-action model for gui visual agent

KQ Lin, L Li, D Gao, Z Yang, S Wu, Z Bai, W Lei… - arxiv preprint arxiv …, 2024 - arxiv.org
Building Graphical User Interface (GUI) assistants holds significant promise for enhancing
human workflow productivity. While most agents are language-based, relying on closed …