Google Učenjak

Expanding performance boundaries of open-source multimodal models with model, data, and test-time scaling

Z Chen, W Wang, Y Cao, Y Liu, Z Gao, E Cui… - arxiv preprint arxiv …, 2024 - arxiv.org

We introduce InternVL 2.5, an advanced multimodal large language model (MLLM) series
that builds upon InternVL 2.0, maintaining its core model architecture while introducing …

Shrani Navedi Navedeno v 46 virih Sorodni članki Vse različice: 2 V obliki HTML

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

Android in the zoo: Chain-of-action-thought for gui agents

J Zhang, J Wu, Y Teng, M Liao, N Xu, X **ao… - arxiv preprint arxiv …, 2024 - arxiv.org

Large language model (LLM) leads to a surge of autonomous GUI agents for smartphone,
which completes a task triggered by natural language through predicting a sequence of …

Shrani Navedi Navedeno v 43 virih Sorodni članki Vse različice: 4 V obliki HTML

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

Gui agents with foundation models: A comprehensive survey

S Wang, W Liu, J Chen, Y Zhou, W Gan, X Zeng… - arxiv preprint arxiv …, 2024 - arxiv.org

Recent advances in foundation models, particularly Large Language Models (LLMs) and
Multimodal Large Language Models (MLLMs), have facilitated the development of intelligent …

Shrani Navedi Navedeno v 5 virih Sorodni članki Vse različice: 2 V obliki HTML

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

Foundations and recent trends in multimodal mobile agents: A survey

B Wu, Y Li, M Fang, Z Song, Z Zhang, Y Wei… - arxiv preprint arxiv …, 2024 - arxiv.org

Mobile agents are essential for automating tasks in complex and dynamic mobile
environments. As foundation models evolve, the demands for agents that can adapt in real …

Shrani Navedi Navedeno v 3 virih Sorodni članki Vse različice: 2 V obliki HTML

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

Mmiu: Multimodal multi-image understanding for evaluating large vision-language models

F Meng, J Wang, C Li, Q Lu, H Tian, J Liao… - arxiv preprint arxiv …, 2024 - arxiv.org

The capability to process multiple images is crucial for Large Vision-Language Models
(LVLMs) to develop a more thorough and nuanced understanding of a scene. Recent multi …

Shrani Navedi Navedeno v 11 virih Sorodni članki Vse različice: 4 V obliki HTML

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

Ferret-ui 2: Mastering universal user interface understanding across platforms

Z Li, K You, H Zhang, D Feng, H Agrawal, X Li… - arxiv preprint arxiv …, 2024 - arxiv.org

Building a generalist model for user interface (UI) understanding is challenging due to
various foundational issues, such as platform diversity, resolution variation, and data …

Shrani Navedi Navedeno v 10 virih Sorodni članki Vse različice: 2 V obliki HTML

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

Os-atlas: A foundation action model for generalist gui agents

Z Wu, Z Wu, F Xu, Y Wang, Q Sun, C Jia… - arxiv preprint arxiv …, 2024 - arxiv.org

Existing efforts in building GUI agents heavily rely on the availability of robust commercial
Vision-Language Models (VLMs) such as GPT-4o and GeminiProVision. Practitioners are …

Shrani Navedi Navedeno v 7 virih Sorodni članki Vse različice: 2 V obliki HTML

[Free GPT-4]
[DeepSeek]

[PDF] openreview.net

Showui: One vision-language-action model for generalist gui agent

KQ Lin, L Li, D Gao, Z Yang, Z Bai, W Lei… - … 2024 Workshop on …, 2024 - openreview.net

Graphical User Interface (GUI) automation holds significant promise for enhancing human
productivity by assisting with digital tasks. While recent Large Language Models (LLMs) and …

Shrani Navedi Navedeno v 6 virih Sorodni članki V obliki HTML

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

Generalist virtual agents: A survey on autonomous agents across digital platforms

M Gao, W Bu, B Miao, Y Wu, Y Li, J Li, S Tang… - arxiv preprint arxiv …, 2024 - arxiv.org

In this paper, we introduce the Generalist Virtual Agent (GVA), an autonomous entity
engineered to function across diverse digital platforms and environments, assisting users by …

Shrani Navedi Navedeno v 3 virih Sorodni članki Vse različice: 2 V obliki HTML

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

Showui: One vision-language-action model for gui visual agent

KQ Lin, L Li, D Gao, Z Yang, S Wu, Z Bai, W Lei… - arxiv preprint arxiv …, 2024 - arxiv.org

Building Graphical User Interface (GUI) assistants holds significant promise for enhancing
human workflow productivity. While most agents are language-based, relying on closed …

Shrani Navedi Navedeno v 3 virih Sorodni članki V obliki HTML

Ustvari opozorilo

Navedi

Napredno iskanje

Shranjeno v Mojo knjižnico

Gui odyssey: A comprehensive dataset for cross-app gui navigation on mobile devices

Expanding performance boundaries of open-source multimodal models with model, data, and test-time scaling

Android in the zoo: Chain-of-action-thought for gui agents

Gui agents with foundation models: A comprehensive survey

Foundations and recent trends in multimodal mobile agents: A survey

Mmiu: Multimodal multi-image understanding for evaluating large vision-language models

Ferret-ui 2: Mastering universal user interface understanding across platforms

Os-atlas: A foundation action model for generalist gui agents

Showui: One vision-language-action model for generalist gui agent

Generalist virtual agents: A survey on autonomous agents across digital platforms

Showui: One vision-language-action model for gui visual agent