AutoGLM: Autonomous Foundation Agents for GUIs (2411.00820v1)

Published 28 Oct 2024 in cs.HC, cs.AI, cs.CL, and cs.LG

Abstract: We present AutoGLM, a new series in the ChatGLM family, designed to serve as foundation agents for autonomous control of digital devices through Graphical User Interfaces (GUIs). While foundation models excel at acquiring human knowledge, they often struggle with decision-making in dynamic real-world environments, limiting their progress toward artificial general intelligence. This limitation underscores the importance of developing foundation agents capable of learning through autonomous environmental interactions by reinforcing existing models. Focusing on Web Browser and Phone as representative GUI scenarios, we have developed AutoGLM as a practical foundation agent system for real-world GUI interactions. Our approach integrates a comprehensive suite of techniques and infrastructures to create deployable agent systems suitable for user delivery. Through this development, we have derived two key insights: First, the design of an appropriate "intermediate interface" for GUI control is crucial, enabling the separation of planning and grounding behaviors, which require distinct optimization for flexibility and accuracy respectively. Second, we have developed a novel progressive training framework that enables self-evolving online curriculum reinforcement learning for AutoGLM. Our evaluations demonstrate AutoGLM's effectiveness across multiple domains. For web browsing, AutoGLM achieves a 55.2% success rate on VAB-WebArena-Lite (improving to 59.1% with a second attempt) and 96.2% on OpenTable evaluation tasks. In Android device control, AutoGLM attains a 36.2% success rate on AndroidLab (VAB-Mobile) and 89.7% on common tasks in popular Chinese APPs.

References (46)

Citations (1)

View on Semantic Scholar

Summary

The paper presents AutoGLM, an extension of the ChatGLM model designed as an autonomous foundation agent capable of interacting with graphical user interfaces like web browsers and Android devices.
AutoGLM introduces an intermediate interface to separate planning and grounding, alongside a progressive online reinforcement learning framework for self-evolving training and failure recovery.
Evaluations show AutoGLM achieves success rates up to 59.1% on VAB-WebArena-Lite and 89.7% on human-evaluated common Android tasks, demonstrating significant improvement over existing models like GPT-4o and Claude-3.5-Sonnet.

AutoGLM: Autonomous Foundation Agents for GUIs

The research paper presents AutoGLM, a new extension to the ChatGLM model family, emphasizing the development of autonomous foundation agents for graphical user interfaces (GUIs). The paper addresses one of the significant shortcomings of contemporary foundation models: the integration of decision-making capabilities in dynamic environments through autonomous interaction with GUIs. This advancement aims to overcome the limitations hindering the progress toward artificial general intelligence (AGI).

Key Concepts and Methodology

AutoGLM is introduced as a foundation agent system specifically designed for real-world GUI interactions, focusing on Web Browsers and Android devices as primary use scenarios. The paper introduces several core methodological innovations:

Intermediate Interface Design: The work presents a novel approach to disentangle planning and grounding behaviors in GUI control. By creating an "intermediate interface", the authors achieve optimized flexibility in planning and enhanced accuracy in grounding interactions, facilitating error recovery and action precision.
Progressive Training Framework: A self-evolving online curriculum reinforcement learning (RL) mechanism is incorporated. This framework focuses on progressively training agents with varying task complexities and emphasizes learning from failures, which is challenging to achieve through offline training alone.

The authors provide a broad suite of techniques required for creating capable foundation agents, utilizing behavior cloning, pre-training enhancement, curriculum learning, and reinforcement learning, among others. Such a comprehensive approach enables AutoGLM to effectively tackle both planning and grounding challenges often encountered in GUI contexts.

Results and Evaluation

The research findings demonstrate AutoGLM's effectiveness across multiple domains of GUI interactions:

Web Browsing: AutoGLM achieved a task success rate (SR) of 55.2% on the VAB-WebArena-Lite benchmark, with potential improvement to 59.1% upon a second attempt. Notably, it obtained a 96.2% SR on OpenTable real-world evaluation tasks.
Android Device Control: AutoGLM showed a 36.2% SR on AndroidLab (previously VAB-Mobile) and 89.7% SR in human evaluations of common tasks within popular Chinese apps.

These results indicate significant performance improvements over existing models such as GPT-4o and Claude-3.5-Sonnet. The AutoGLM system is also made accessible through the Qingyan Browser Plugin for web applications and an Android AccessibilityService interface for device control testing.

Implications and Future Directions

The implications of this research are multifaceted. Practically, AutoGLM represents substantial progress in the development of user-interactive, deployable intelligent agents capable of automating complex tasks across vastly used digital interfaces. Theoretically, this work contributes to the broader understanding of integrating decision-making capabilities in AI, moving closer to realizing AGI aspirations.

Future developments, as suggested by the authors, will focus on further refining the interface design to enhance modularity and scalability, as well as improving reinforcement learning techniques to boost agent adaptability and learning efficiency in diverse environments. Additional experiments with larger and more diverse datasets could further improve AutoGLM's proficiency, transitioning from prototype stages toward more robust, real-world applications.

In summary, the paper demonstrates the potential of combining advanced machine learning methods with robust system design to achieve functional interactive agents, marking a noteworthy advancement in GUI-based automation and interaction.

PDF Markdown

Related Papers

Tweets

https://twitter.com/ChatGLM/status/1854039949806555501

https://twitter.com/ericdongyx/status/1853629941856002053

https://twitter.com/gm8xx8/status/1853856277752942758

https://twitter.com/webagentlab/status/1881249793437217264