- Academic Search

B Gou, R Wang, B Zheng, Y **e, C Chang… - arxiv preprint arxiv …, 2024 - arxiv.org

Multimodal large language models (MLLMs) are transforming the capabilities of graphical
user interface (GUI) agents, facilitating their transition from controlled simulations to …

Speichern Zitieren Zitiert von: 12 Ähnliche Artikel HTML-Version

[Free GPT-4]

[PDF] arxiv.org

Visrag: Vision-based retrieval-augmented generation on multi-modality documents

S Yu, C Tang, B Xu, J Cui, J Ran, Y Yan, Z Liu… - arxiv preprint arxiv …, 2024 - arxiv.org

Retrieval-augmented generation (RAG) is an effective technique that enables large
language models (LLMs) to utilize external knowledge sources for generation. However …

Speichern Zitieren Zitiert von: 9 Ähnliche Artikel HTML-Version

[Free GPT-4]

[PDF] openreview.net

Showui: One vision-language-action model for generalist gui agent

KQ Lin, L Li, D Gao, Z Yang, Z Bai, W Lei… - … 2024 Workshop on …, 2024 - openreview.net

Graphical User Interface (GUI) automation holds significant promise for enhancing human
productivity by assisting with digital tasks. While recent Large Language Models (LLMs) and …

Speichern Zitieren Zitiert von: 4 Ähnliche Artikel HTML-Version

[Free GPT-4]

[PDF] arxiv.org

Os-atlas: A foundation action model for generalist gui agents

Z Wu, Z Wu, F Xu, Y Wang, Q Sun, C Jia… - arxiv preprint arxiv …, 2024 - arxiv.org

Existing efforts in building GUI agents heavily rely on the availability of robust commercial
Vision-Language Models (VLMs) such as GPT-4o and GeminiProVision. Practitioners are …

Speichern Zitieren Zitiert von: 6 Ähnliche Artikel HTML-Version

[Free GPT-4]

[PDF] arxiv.org

Aligning to thousands of preferences via system message generalization

S Lee, SH Park, S Kim, M Seo - arxiv preprint arxiv:2405.17977, 2024 - arxiv.org

Although humans inherently have diverse values, current large language model (LLM)
alignment methods often assume that aligning LLMs with the general public's preferences is …

Speichern Zitieren Zitiert von: 15 Ähnliche Artikel Alle 2 Versionen HTML-Version

[Free GPT-4]

[PDF] arxiv.org

Oscar: Operating system control via state-aware reasoning and re-planning

X Wang, B Liu - arxiv preprint arxiv:2410.18963, 2024 - arxiv.org

Large language models (LLMs) and large multimodal models (LMMs) have shown great
potential in automating complex tasks like web browsing and gaming. However, their ability …

Speichern Zitieren Zitiert von: 3 Ähnliche Artikel HTML-Version

[Free GPT-4]

[PDF] arxiv.org

Re-Invoke: Tool invocation rewriting for zero-shot tool retrieval

Y Chen, J Yoon, DS Sachan, Q Wang… - arxiv preprint arxiv …, 2024 - arxiv.org

Recent advances in large language models (LLMs) have enabled autonomous agents with
complex reasoning and task-fulfillment capabilities using a wide range of tools. However …

Speichern Zitieren Zitiert von: 2 Ähnliche Artikel Alle 3 Versionen HTML-Version

[Free GPT-4]

[PDF] arxiv.org

Learning to ask: When llms meet unclear instruction

W Wang, J Shi, C Wang, C Lee, Y Yuan… - arxiv preprint arxiv …, 2024 - arxiv.org

Equipped with the capability to call functions, modern large language models (LLMs) can
leverage external tools for addressing a range of tasks unattainable through language skills …

Speichern Zitieren Zitiert von: 2 Ähnliche Artikel Alle 2 Versionen HTML-Version

[Free GPT-4]

[PDF] arxiv.org

MMDocBench: Benchmarking Large Vision-Language Models for Fine-Grained Visual Document Understanding

F Zhu, Z Liu, XY Ng, H Wu, W Wang, F Feng… - arxiv preprint arxiv …, 2024 - arxiv.org

Large Vision-Language Models (LVLMs) have achieved remarkable performance in many
vision-language tasks, yet their capabilities in fine-grained visual understanding remain …

Speichern Zitieren Zitiert von: 1 Ähnliche Artikel HTML-Version

[Free GPT-4]

[PDF] arxiv.org

AI Agents for Computer Use: A Review of Instruction-based Computer Control, GUI Automation, and Operator Assistants

PJ Sager, B Meyer, P Yan… - arxiv preprint arxiv …, 2025 - arxiv.org

Instruction-based computer control agents (CCAs) execute complex action sequences on
personal computers or mobile devices to fulfill tasks using the same graphical user …

Speichern Zitieren Ähnliche Artikel HTML-Version

Zitieren

Erweiterte Suche

In „Meine Bibliothek“ gespeichert

Navigating the digital world as humans do: Universal visual grounding for gui agents

Visrag: Vision-based retrieval-augmented generation on multi-modality documents

Showui: One vision-language-action model for generalist gui agent

Os-atlas: A foundation action model for generalist gui agents

Aligning to thousands of preferences via system message generalization

Oscar: Operating system control via state-aware reasoning and re-planning

Re-Invoke: Tool invocation rewriting for zero-shot tool retrieval

Learning to ask: When llms meet unclear instruction

MMDocBench: Benchmarking Large Vision-Language Models for Fine-Grained Visual Document Understanding

AI Agents for Computer Use: A Review of Instruction-based Computer Control, GUI Automation, and Operator Assistants