Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
72 tokens/sec
GPT-4o
61 tokens/sec
Gemini 2.5 Pro Pro
44 tokens/sec
o3 Pro
8 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Aguvis: Unified Pure Vision Agents for Autonomous GUI Interaction (2412.04454v1)

Published 5 Dec 2024 in cs.CL

Abstract: Graphical User Interfaces (GUIs) are critical to human-computer interaction, yet automating GUI tasks remains challenging due to the complexity and variability of visual environments. Existing approaches often rely on textual representations of GUIs, which introduce limitations in generalization, efficiency, and scalability. In this paper, we introduce Aguvis, a unified pure vision-based framework for autonomous GUI agents that operates across various platforms. Our approach leverages image-based observations, and grounding instructions in natural language to visual elements, and employs a consistent action space to ensure cross-platform generalization. To address the limitations of previous work, we integrate explicit planning and reasoning within the model, enhancing its ability to autonomously navigate and interact with complex digital environments. We construct a large-scale dataset of GUI agent trajectories, incorporating multimodal reasoning and grounding, and employ a two-stage training pipeline that first focuses on general GUI grounding, followed by planning and reasoning. Through comprehensive experiments, we demonstrate that Aguvis surpasses previous state-of-the-art methods in both offline and real-world online scenarios, achieving, to our knowledge, the first fully autonomous pure vision GUI agent capable of performing tasks independently without collaboration with external closed-source models. We open-sourced all datasets, models, and training recipes to facilitate future research at https://aguvis-project.github.io/.

Overview of Aguvis: Unified Pure Vision Agents for Autonomous GUI Interaction

The paper presents Aguvis, a novel framework for building autonomous graphical user interface (GUI) agents, leveraging a unified pure vision approach. Aguvis targets the automation of task execution across digital environments, such as websites, desktops, and mobile devices, using a consistent action space and image-based observations. The framework addresses the limitations of contemporary methods that depend on textual representations, such as generalized models' efficiency and scalability.

Core Competencies and Challenges

The authors outline three critical competencies that must be developed for effective GUI agents: understanding, grounding, and planning. By comprehension of high-resolution and complex human-oriented interfaces, Aguvis enables contextual awareness and reasoning. Grounding maps natural language instructions to GUI observations, while planning synthesizes this information to generate actionable steps towards task completion.

The paper identifies significant challenges in addressing grounding and reasoning for GUI tasks:

  1. Pure Vision Framework Enhancement: Traditional models often use textual representations like HTML or accessibility trees, which can be verbose, environment-specific, and hard to generalize. Aguvis utilizes image-based representations that provide a more uniform and efficient basis for GUI interpretation, aligning more closely with intuitive human cognition and lowering inference latency.
  2. Cross-Platform Action Space Unification: Variabilities in GUI-based interactive environments necessitate a unified action space to enable model generalization. Aguvis pairs vision-based grounding with a "pyautogui" command system, abstracting platform-specific differences into a standardized framework.
  3. Integration of Planning and Grounding: Conventional methods usually rely on closed-source LLMs for reasoning or directly map actions without an explicit reasoning model. Aguvis integrates planning and grounding within its vision-LLM (VLM) pipeline, mitigating reliance on separate reasoning models.

Methodological Innovation

Aguvis proposes an expansive dataset collection strategy, unifying existing GUI interactions and augmenting data through systematic template-based expansions. It operationalizes a two-stage training system: initial GUI grounding through intensive grounding data, followed by planning and reasoning training through multi-step datasets. The inclusion of a dual-stage training protocol allows the model to learn atomic visual grounding tasks before exploring expanded agent trajectory reasoning, thereby enhancing task complexity capabilities.

Notably, results from numerous experiments demonstrate Aguvis' superior performance, outperforming state-of-the-art methods. The model achieved higher accuracy on ScreenSpot, Mind2Web, and AndroidControl, proving its ability to autonomously perform tasks within real-world environments. The research pledges to open-source the dataset, models, and training resources to encourage collaborative advancement within this domain.

Future Implications

In terms of practical applications, the unified framework promises automation advancements across complex digital interfaces. Theoretical developments in AI focusing on task planning, reasoning, and mastering cross-platform competencies are evident from integrating diverse GUI environments into a converging framework.

This paper underscores the potential future direction for autonomous agents in AI, suggesting a shift towards entirely vision-oriented models with more dynamic reasoning and planning abilities independently of traditional LLMs. The functionality and adaptability of such agents could play a crucial role in expanding AI applications and enhancing human-computer interaction metrics.

In conclusion, Aguvis propels a significant step in AI by synergizing vision-LLMs with pragmatic applications, giving an insightful perspective on advancing AI agents' automation capacities in diverse GUI environments.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (9)
  1. Yiheng Xu (20 papers)
  2. Zekun Wang (50 papers)
  3. Junli Wang (18 papers)
  4. Dunjie Lu (2 papers)
  5. Tianbao Xie (22 papers)
  6. Amrita Saha (23 papers)
  7. Doyen Sahoo (47 papers)
  8. Tao Yu (282 papers)
  9. Caiming Xiong (337 papers)
Github Logo Streamline Icon: https://streamlinehq.com