Agent S: An Open Agentic Framework that Uses Computers Like a Human (2410.08164v1)

Published 10 Oct 2024 in cs.AI, cs.CL, and cs.CV

Abstract: We present Agent S, an open agentic framework that enables autonomous interaction with computers through a Graphical User Interface (GUI), aimed at transforming human-computer interaction by automating complex, multi-step tasks. Agent S aims to address three key challenges in automating computer tasks: acquiring domain-specific knowledge, planning over long task horizons, and handling dynamic, non-uniform interfaces. To this end, Agent S introduces experience-augmented hierarchical planning, which learns from external knowledge search and internal experience retrieval at multiple levels, facilitating efficient task planning and subtask execution. In addition, it employs an Agent-Computer Interface (ACI) to better elicit the reasoning and control capabilities of GUI agents based on Multimodal LLMs (MLLMs). Evaluation on the OSWorld benchmark shows that Agent S outperforms the baseline by 9.37% on success rate (an 83.6% relative improvement) and achieves a new state-of-the-art. Comprehensive analysis highlights the effectiveness of individual components and provides insights for future improvements. Furthermore, Agent S demonstrates broad generalizability to different operating systems on a newly-released WindowsAgentArena benchmark. Code available at https://github.com/simular-ai/Agent-S.

PDF HTML Abstract

Agent S: An Open Agentic Framework for Autonomous GUI Interaction

The paper introduces Agent S, an innovative agent framework designed to automate desktop tasks through Graphical User Interfaces (GUIs), simulating human interaction via the keyboard and mouse. The core proposition is to enhance task automation by overcoming challenges such as acquiring domain-specific knowledge, planning for long-horizon tasks, and managing dynamic interface variability.

Key Features and Methodology

Agent S addresses these challenges through several sophisticated strategies:

Experience-Augmented Hierarchical Planning:
- This approach merges external and internal knowledge for efficient task decomposition into subtasks. It leverages Online Web Search for real-time domain knowledge and Narrative Memory for past high-level task experiences.
- Memory modules enhance decision-making by storing and utilizing episodic memories for detailed subtask execution.
Agent-Computer Interface (ACI):
- ACI offers a language-centric abstraction layer crucial for effective GUI interaction. The dual-input strategy combines visual input with an image-augmented accessibility tree, improving element grounding and action precision.
- The framework constrains actions to a bounded action space, optimizing the ability of Multimodal LLMs (MLLMs) to manage actions effectively.

Experimental Results

The evaluation of Agent S on the OSWorld benchmark demonstrated a significant leap in performance, with a 9.37% increase in success rate, translating to an 83.6% relative improvement over previous standards. Its performance on the WindowsAgentArena benchmark further showcases strong generalizability across different operating systems without specific adaptations.

Implications and Future Work

The contributions are manifold:

Practical Implications: By achieving substantial improvements in automating desktop tasks, Agent S enhances productivity and accessibility, potentially transforming user interaction paradigms.
Theoretical Implications: The proposed fusion of experience-augmented planning and MLLM interfaces opens avenues for refining cognitive architectures in agent frameworks.

Speculatively, future developments in AI could see these frameworks adapted to smaller, open-source MLLMs, extending capability while maintaining fidelity to complex task requirements. Additionally, exploring performance trade-offs in execution speed and resource consumption could further refine agent deployment in practical applications.

Conclusion

Agent S exemplifies the adaptation of AI for complex task automation in real-world GUI environments. Its blend of hierarchical planning, experiential learning, and abstracted interaction layers represents a significant step towards human-like computer control by autonomous agents. This work lays a foundation for further exploration of zero-shot, agentic methods in the evolving landscape of AI-driven automation.