Navigating the digital world as humans do: Universal visual grounding for gui agents

B Gou, R Wang, B Zheng, Y **e, C Chang… - arxiv preprint arxiv …, 2024 - arxiv.org
Multimodal large language models (MLLMs) are transforming the capabilities of graphical
user interface (GUI) agents, facilitating their transition from controlled simulations to …

Visrag: Vision-based retrieval-augmented generation on multi-modality documents

S Yu, C Tang, B Xu, J Cui, J Ran, Y Yan, Z Liu… - arxiv preprint arxiv …, 2024 - arxiv.org
Retrieval-augmented generation (RAG) is an effective technique that enables large
language models (LLMs) to utilize external knowledge sources for generation. However …

Showui: One vision-language-action model for generalist gui agent

KQ Lin, L Li, D Gao, Z Yang, Z Bai, W Lei… - … 2024 Workshop on …, 2024 - openreview.net
Graphical User Interface (GUI) automation holds significant promise for enhancing human
productivity by assisting with digital tasks. While recent Large Language Models (LLMs) and …

Os-atlas: A foundation action model for generalist gui agents

Z Wu, Z Wu, F Xu, Y Wang, Q Sun, C Jia… - arxiv preprint arxiv …, 2024 - arxiv.org
Existing efforts in building GUI agents heavily rely on the availability of robust commercial
Vision-Language Models (VLMs) such as GPT-4o and GeminiProVision. Practitioners are …

Aligning to thousands of preferences via system message generalization

S Lee, SH Park, S Kim, M Seo - arxiv preprint arxiv:2405.17977, 2024 - arxiv.org
Although humans inherently have diverse values, current large language model (LLM)
alignment methods often assume that aligning LLMs with the general public's preferences is …

Oscar: Operating system control via state-aware reasoning and re-planning

X Wang, B Liu - arxiv preprint arxiv:2410.18963, 2024 - arxiv.org
Large language models (LLMs) and large multimodal models (LMMs) have shown great
potential in automating complex tasks like web browsing and gaming. However, their ability …

Re-Invoke: Tool invocation rewriting for zero-shot tool retrieval

Y Chen, J Yoon, DS Sachan, Q Wang… - arxiv preprint arxiv …, 2024 - arxiv.org
Recent advances in large language models (LLMs) have enabled autonomous agents with
complex reasoning and task-fulfillment capabilities using a wide range of tools. However …

Learning to ask: When llms meet unclear instruction

W Wang, J Shi, C Wang, C Lee, Y Yuan… - arxiv preprint arxiv …, 2024 - arxiv.org
Equipped with the capability to call functions, modern large language models (LLMs) can
leverage external tools for addressing a range of tasks unattainable through language skills …

MMDocBench: Benchmarking Large Vision-Language Models for Fine-Grained Visual Document Understanding

F Zhu, Z Liu, XY Ng, H Wu, W Wang, F Feng… - arxiv preprint arxiv …, 2024 - arxiv.org
Large Vision-Language Models (LVLMs) have achieved remarkable performance in many
vision-language tasks, yet their capabilities in fine-grained visual understanding remain …

AI Agents for Computer Use: A Review of Instruction-based Computer Control, GUI Automation, and Operator Assistants

PJ Sager, B Meyer, P Yan… - arxiv preprint arxiv …, 2025 - arxiv.org
Instruction-based computer control agents (CCAs) execute complex action sequences on
personal computers or mobile devices to fulfill tasks using the same graphical user …