OS-Genesis: Automating GUI Agent Trajectory Construction via Reverse Task Synthesis (2412.19723v3)

Published 27 Dec 2024 in cs.AI, cs.CL, cs.CV, and cs.HC

Abstract: Graphical User Interface (GUI) agents powered by Vision-LLMs (VLMs) have demonstrated human-like computer control capability. Despite their utility in advancing digital automation, a critical bottleneck persists: collecting high-quality trajectory data for training. Common practices for collecting such data rely on human supervision or synthetic data generation through executing pre-defined tasks, which are either resource-intensive or unable to guarantee data quality. Moreover, these methods suffer from limited data diversity and significant gaps between synthetic data and real-world environments. To address these challenges, we propose OS-Genesis, a novel GUI data synthesis pipeline that reverses the conventional trajectory collection process. Instead of relying on pre-defined tasks, OS-Genesis enables agents first to perceive environments and perform step-wise interactions, then retrospectively derive high-quality tasks to enable trajectory-level exploration. A trajectory reward model is then employed to ensure the quality of the generated trajectories. We demonstrate that training GUI agents with OS-Genesis significantly improves their performance on highly challenging online benchmarks. In-depth analysis further validates OS-Genesis's efficiency and its superior data quality and diversity compared to existing synthesis methods. Our codes, data, and checkpoints are available at https://qiushisun.github.io/OS-Genesis-Home/.

Summary

The paper introduces a novel reverse task synthesis method that automates and diversifies GUI agent trajectory construction without relying on pre-defined tasks.
The methodology employs an interaction-driven exploration paired with a trajectory reward model to ensure coherent, high-quality data for agent training.
Experimental results on benchmarks like AndroidWorld show a significant boost in task success rates, underscoring the pipeline's scalability and practical benefits.

Overview of OS-Genesis: Automating GUI Agent Trajectory Construction via Reverse Task Synthesis

The paper "OS-Genesis: Automating GUI Agent Trajectory Construction via Reverse Task Synthesis" presents an innovative approach to addressing the critical challenges in training Graphical User Interface (GUI) agents. These agents, powered by Vision-LLMs (VLMs), have shown promising capabilities in automating computer tasks. However, the research identifies a significant bottleneck in the collection of high-quality trajectory data necessary for training these agents effectively.

Key Contributions

The OS-Genesis system introduces a novel pipeline to synthesize high-quality and diverse GUI agent trajectories. Unlike traditional methods that rely on pre-defined tasks, which often result in limited scalability and data diversity, OS-Genesis employs an interaction-driven approach. This method reverses the conventional trajectory collection process, allowing agents to perceive environments and engage in step-wise interactions before retrospectively deriving tasks. This reverse task synthesis effectively constructs trajectories through a detailed understanding of GUI interactions.

Methodology

Interaction-Driven Functional Discovery: The system begins with an agent-driven exploration of GUI environments, such as web browsers and Android emulators, to discover interactive elements. This process generates large datasets of triplets that record pre- and post-action states along with the actions performed.
Reverse Task Synthesis: Leveraging the interaction data, OS-Genesis retroactively creates low- and high-level instructions that map well to the observed functionalities of the GUI elements. This is achieved through a sophisticated annotation model that extrapolates meaningful tasks from the state transitions caused by interactions.
Trajectory Reward Model (TRM): To ensure data quality, a reward model assesses the coherence and completeness of the generated trajectories. This aspect is particularly critical, as it ensures that even incomplete trajectories, which retain valuable interaction data, are utilized effectively.

Experimental Evaluation

The evaluation of OS-Genesis was conducted on challenging benchmarks like AndroidWorld and WebArena. The results show substantial improvements in the performance of GUI agents trained with OS-Genesis data compared to those trained on task-driven baselines. For instance, OS-Genesis improved the task success rate from 9.82% to 17.41% on AndroidWorld, demonstrating its efficacy in generating high-quality training data that significantly enhances agent capabilities.

Implications and Future Perspectives

OS-Genesis sets a new standard in the field of GUI agent training by eliminating the dependency on resource-intensive human supervision and pre-defined tasks. The pipeline not only enhances the training process but also aligns the synthesized data more closely with real-world scenarios, thus increasing the practical applicability of GUI agents.

This work opens new avenues for expanding AI's role in digital automation by leveraging VLMs and interaction data to create more flexible and powerful agents. Future developments could focus on refining the TRM, exploring open-source annotations to replace proprietary models, and extending the OS-Genesis approach to more diverse environments and applications.

In conclusion, OS-Genesis significantly advances the state of GUI agent training by providing a scalable, efficient, and diverse data synthesis approach. This work lays a solid foundation for future exploration in autonomous digital interfaces and reinforces the importance of innovative data collection methods in AI research.

PDF Markdown

Related Papers

Tweets

https://twitter.com/javaeeeee1/status/1874780344383422669

https://twitter.com/arXivGPT/status/1875241714979311825

https://twitter.com/susumuota/status/1878594158484431111