Navigating the digital world as humans do: Universal visual grounding for gui agents
Multimodal large language models (MLLMs) are transforming the capabilities of graphical
user interface (GUI) agents, facilitating their transition from controlled simulations to …
user interface (GUI) agents, facilitating their transition from controlled simulations to …
Visrag: Vision-based retrieval-augmented generation on multi-modality documents
Retrieval-augmented generation (RAG) is an effective technique that enables large
language models (LLMs) to utilize external knowledge sources for generation. However …
language models (LLMs) to utilize external knowledge sources for generation. However …
Showui: One vision-language-action model for generalist gui agent
Graphical User Interface (GUI) automation holds significant promise for enhancing human
productivity by assisting with digital tasks. While recent Large Language Models (LLMs) and …
productivity by assisting with digital tasks. While recent Large Language Models (LLMs) and …
Os-atlas: A foundation action model for generalist gui agents
Existing efforts in building GUI agents heavily rely on the availability of robust commercial
Vision-Language Models (VLMs) such as GPT-4o and GeminiProVision. Practitioners are …
Vision-Language Models (VLMs) such as GPT-4o and GeminiProVision. Practitioners are …
Aligning to thousands of preferences via system message generalization
Although humans inherently have diverse values, current large language model (LLM)
alignment methods often assume that aligning LLMs with the general public's preferences is …
alignment methods often assume that aligning LLMs with the general public's preferences is …
Oscar: Operating system control via state-aware reasoning and re-planning
Large language models (LLMs) and large multimodal models (LMMs) have shown great
potential in automating complex tasks like web browsing and gaming. However, their ability …
potential in automating complex tasks like web browsing and gaming. However, their ability …
Re-Invoke: Tool invocation rewriting for zero-shot tool retrieval
Recent advances in large language models (LLMs) have enabled autonomous agents with
complex reasoning and task-fulfillment capabilities using a wide range of tools. However …
complex reasoning and task-fulfillment capabilities using a wide range of tools. However …
Learning to ask: When llms meet unclear instruction
Equipped with the capability to call functions, modern large language models (LLMs) can
leverage external tools for addressing a range of tasks unattainable through language skills …
leverage external tools for addressing a range of tasks unattainable through language skills …
MMDocBench: Benchmarking Large Vision-Language Models for Fine-Grained Visual Document Understanding
Large Vision-Language Models (LVLMs) have achieved remarkable performance in many
vision-language tasks, yet their capabilities in fine-grained visual understanding remain …
vision-language tasks, yet their capabilities in fine-grained visual understanding remain …
AI Agents for Computer Use: A Review of Instruction-based Computer Control, GUI Automation, and Operator Assistants
Instruction-based computer control agents (CCAs) execute complex action sequences on
personal computers or mobile devices to fulfill tasks using the same graphical user …
personal computers or mobile devices to fulfill tasks using the same graphical user …