ZeroGUI: Automating Online GUI Learning at Zero Human Cost (2505.23762v1)

Published 29 May 2025 in cs.AI, cs.CL, and cs.CV

Abstract: The rapid advancement of large Vision-LLMs (VLMs) has propelled the development of pure-vision-based GUI Agents, capable of perceiving and operating Graphical User Interfaces (GUI) to autonomously fulfill user instructions. However, existing approaches usually adopt an offline learning framework, which faces two core limitations: (1) heavy reliance on high-quality manual annotations for element grounding and action supervision, and (2) limited adaptability to dynamic and interactive environments. To address these limitations, we propose ZeroGUI, a scalable, online learning framework for automating GUI Agent training at Zero human cost. Specifically, ZeroGUI integrates (i) VLM-based automatic task generation to produce diverse training goals from the current environment state, (ii) VLM-based automatic reward estimation to assess task success without hand-crafted evaluation functions, and (iii) two-stage online reinforcement learning to continuously interact with and learn from GUI environments. Experiments on two advanced GUI Agents (UI-TARS and Aguvis) demonstrate that ZeroGUI significantly boosts performance across OSWorld and AndroidLab environments. The code is available at https://github.com/OpenGVLab/ZeroGUI.

Summary

Overview of ZeroGUI: Automating Online GUI Learning at Zero Human Cost

The paper "ZeroGUI: Automating Online GUI Learning at Zero Human Cost" presents an innovative framework for training Graphical User Interface (GUI) agents using Vision-LLMs (VLMs). It addresses significant limitations inherent in existing methods for training GUI agents, particularly their reliance on offline learning paradigms, which require extensive manual annotation and are constrained in dynamic environments. The proposed ZeroGUI framework automates the training process for GUI agents at zero human cost by leveraging advanced capabilities of VLMs, thus enabling scalable and adaptive learning in real-time environments.

Core Contributions

The paper introduces several key components designed to overcome the limitations of traditional GUI agent training approaches:

Automatic Task Generation: ZeroGUI uses VLMs to generate diverse and scalable sets of training tasks automatically. These tasks are constructed based on real-time observations from the environment, eliminating the need for manual design and ensuring sufficient diversity to cover various operational scenarios.
Automatic Reward Estimation: The framework employs VLMs to estimate rewards automatically, assessing task completion without relying on predefined script-based verifiers. This component is critical in ensuring reliable feedback to the agent without the necessity of hand-crafted evaluation functions, enhancing adaptability and performance in novel tasks.
Two-stage Online Reinforcement Learning: ZeroGUI incorporates a two-stage online learning approach—initially, training with the automatically generated tasks builds foundational capabilities in agents, followed by test-time adaptation using the same reward estimation method, allowing agents to refine their strategies in real-world environments dynamically.

Experimental Validation

The framework is validated through comprehensive experiments using OSWorld and AndroidLab environments. The results are noteworthy:

ZeroGUI demonstrated a 14% relative improvement in task completion success rates using UI-TARS-7B-DPO and an even more remarkable 63% improvement using Aguvis-7B in a scalable, online reinforcement learning setting.
The framework significantly enhances the agent's adaptability to changes and dynamic elements within GUI environments, reducing overfitting to static task sets.
The adoption of VLM-based components for task generation and reward estimation effectively eliminates the need for human intervention in the training process, providing a cost-effective and scalable solution.

Theoretical and Practical Implications

The theoretical implications of ZeroGUI are substantial as it pushes the boundary of reinforcement learning applications within UI interactions, leveraging VLMs for autonomously adapting agent capabilities at scale. Practically, the framework's automation significantly reduces human cost and effort, offering organizations and developers a robust tool to enhance intelligent UI automation.

Future Work and Developments

The paper speculates that this approach could pave the way for more sophisticated and automated training methods in various AI applications, particularly those requiring dynamic interaction with real-world environments. Future developments may include further exploration into optimizing reward estimation processes and enhancing scalability across broader application domains, such as web interfaces and mobile platforms.

Overall, the ZeroGUI framework represents a formidable advancement in the field of automated GUI agent training, facilitating the deployment of efficient, adaptable, and cost-effective AI-driven interaction models in practical settings.

Related Papers

GitHub

GitHub - OpenGVLab/ZeroGUI: ZeroGUI: Automating Online GUI Learning at Zero Human Cost (3 stars)

Tweets

https://twitter.com/_shiqiansu/status/1928285100220784877

YouTube

Show All Videos