Google 학술 검색

Expanding performance boundaries of open-source multimodal models with model, data, and test-time scaling

Z Chen, W Wang, Y Cao, Y Liu, Z Gao, E Cui… - arxiv preprint arxiv …, 2024 - arxiv.org

We introduce InternVL 2.5, an advanced multimodal large language model (MLLM) series
that builds upon InternVL 2.0, maintaining its core model architecture while introducing …

저장 인용 28회 인용 관련 학술자료 전체 2개의 버전 HTML 버전

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

Foundations and recent trends in multimodal mobile agents: A survey

B Wu, Y Li, M Fang, Z Song, Z Zhang, Y Wei… - arxiv preprint arxiv …, 2024 - arxiv.org

Mobile agents are essential for automating tasks in complex and dynamic mobile
environments. As foundation models evolve, the demands for agents that can adapt in real …

저장 인용 3회 인용 관련 학술자료 전체 2개의 버전 HTML 버전

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

Ferret-ui 2: Mastering universal user interface understanding across platforms

Z Li, K You, H Zhang, D Feng, H Agrawal, X Li… - arxiv preprint arxiv …, 2024 - arxiv.org

Building a generalist model for user interface (UI) understanding is challenging due to
various foundational issues, such as platform diversity, resolution variation, and data …

저장 인용 10회 인용 관련 학술자료 전체 2개의 버전 HTML 버전

[Free GPT-4]
[DeepSeek]

[PDF] openreview.net

Showui: One vision-language-action model for generalist gui agent

KQ Lin, L Li, D Gao, Z Yang, Z Bai, W Lei… - … 2024 Workshop on …, 2024 - openreview.net

Graphical User Interface (GUI) automation holds significant promise for enhancing human
productivity by assisting with digital tasks. While recent Large Language Models (LLMs) and …

저장 인용 4회 인용 관련 학술자료 HTML 버전

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

Os-atlas: A foundation action model for generalist gui agents

Z Wu, Z Wu, F Xu, Y Wang, Q Sun, C Jia… - arxiv preprint arxiv …, 2024 - arxiv.org

Existing efforts in building GUI agents heavily rely on the availability of robust commercial
Vision-Language Models (VLMs) such as GPT-4o and GeminiProVision. Practitioners are …

저장 인용 6회 인용 관련 학술자료 전체 2개의 버전 HTML 버전

[Free GPT-4]
[DeepSeek]

[PDF] preprints.org

[PDF][PDF] Llm-powered gui agents in phone automation: Surveying progress and prospects

W Liu, L Liu, Y Guo, H **ao, W Lin, Y Chai, S Ren… - 2025 - preprints.org

With the rapid rise of large language models (LLMs), phone automation has undergone
transformative changes. This paper systematically reviews LLM-driven phone GUI agents …

저장 인용 1회 인용 관련 학술자료 HTML 버전

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

Showui: One vision-language-action model for gui visual agent

KQ Lin, L Li, D Gao, Z Yang, S Wu, Z Bai, W Lei… - arxiv preprint arxiv …, 2024 - arxiv.org

Building Graphical User Interface (GUI) assistants holds significant promise for enhancing
human workflow productivity. While most agents are language-based, relying on closed …

저장 인용 1회 인용 관련 학술자료 HTML 버전

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

Inferring Alt-text For UI Icons With Large Language Models During App Development

S Haque, C Csallner - arxiv preprint arxiv:2409.18060, 2024 - arxiv.org

Ensuring accessibility in mobile applications remains a significant challenge, particularly for
visually impaired users who rely on screen readers. User interface icons are essential for …

저장 인용 1회 인용 관련 학술자료 전체 2개의 버전 HTML 버전

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

Aguvis: Unified Pure Vision Agents for Autonomous GUI Interaction

Y Xu, Z Wang, J Wang, D Lu, T **e, A Saha… - arxiv preprint arxiv …, 2024 - arxiv.org

Graphical User Interfaces (GUIs) are critical to human-computer interaction, yet automating
GUI tasks remains challenging due to the complexity and variability of visual environments …

저장 인용 관련 학술자료 전체 2개의 버전 HTML 버전

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

InfiGUIAgent: A Multimodal Generalist GUI Agent with Native Reasoning and Reflection

Y Liu, P Li, Z Wei, C **e, X Hu, X Xu, S Zhang… - arxiv preprint arxiv …, 2025 - arxiv.org

Graphical User Interface (GUI) Agents, powered by multimodal large language models
(MLLMs), have shown great potential for task automation on computing devices such as …

저장 인용 관련 학술자료 전체 2개의 버전 HTML 버전

알림 만들기

인용

고급 검색

라이브러리에 저장됨

Amex: Android multi-annotation expo dataset for mobile gui agents

Expanding performance boundaries of open-source multimodal models with model, data, and test-time scaling

Foundations and recent trends in multimodal mobile agents: A survey

Ferret-ui 2: Mastering universal user interface understanding across platforms

Showui: One vision-language-action model for generalist gui agent

Os-atlas: A foundation action model for generalist gui agents

[PDF][PDF] Llm-powered gui agents in phone automation: Surveying progress and prospects

Showui: One vision-language-action model for gui visual agent

Inferring Alt-text For UI Icons With Large Language Models During App Development

Aguvis: Unified Pure Vision Agents for Autonomous GUI Interaction

InfiGUIAgent: A Multimodal Generalist GUI Agent with Native Reasoning and Reflection