AgentSynth: Scalable Task Generation for Generalist Computer-Use Agents (2506.14205v1)

Published 17 Jun 2025 in cs.CL

Abstract: We introduce AgentSynth, a scalable and cost-efficient pipeline for automatically synthesizing high-quality tasks and trajectory datasets for generalist computer-use agents. Leveraging information asymmetry, AgentSynth constructs subtasks that are simple during generation but significantly more challenging when composed into long-horizon tasks, enabling the creation of over 6,000 diverse and realistic tasks. Our pipeline begins with an LLM-based task proposer guided by a persona, followed by an execution agent that completes the task and logs the trajectory. This process is repeated iteratively to form a sequence of subtasks, which are then summarized by a separate agent into a composite task of controllable difficulty. A key strength of AgentSynth is its ability to precisely modulate task complexity by varying the number of subtasks. Empirical evaluations show that state-of-the-art LLM agents suffer a steep performance drop, from 18% success at difficulty level 1 to just 4% at level 6, highlighting the benchmark's difficulty and discriminative power. Moreover, our pipeline achieves a low average cost of \$0.60 per trajectory, orders of magnitude cheaper than human annotations. Our code and data are publicly available at https://github.com/sunblaze-ucb/AgentSynth

Summary

The paper presents AgentSynth, a pipeline that automatically generates scalable and diverse computer-use tasks by decomposing complex interactions into manageable subtasks.
The methodology employs LLMs for task proposing, execution, verification, revision, and iterative summarization, achieving low annotation costs and high task fidelity.
Empirical results reveal that as task complexity increases, agent performance sharply drops while each task costs only $0.60, underscoring the system’s effectiveness.

AgentSynth: Scalable Task Generation for Generalist Computer-Use Agents

The paper introduces AgentSynth, a novel pipeline designed to synthesize scalable and diverse datasets for training generalist computer-use agents. The framework uses LLMs to construct computer-use tasks, which are increasingly significant given the complex, multi-step interactions required in general computing environments. This paper addresses existing challenges in dataset creation by proposing a system that efficiently generates high-quality, complex tasks tailored to evaluate the generalist capabilities of LLM agents.

Methodology Overview

AgentSynth operates by leveraging a combination of LLM-based agents and a systematic use of information asymmetry. The approach is rooted in breaking down complex, long-horizon tasks into manageable subtasks, which are then recomposed into more comprehensive tasks. This methodology allows the platform to control task complexity accurately by varying subtask chains:

Task Proposer: An initial task is generated based on a given persona, allowing for task diversity and realism.
Task Executor: Utilizes both GPT-4.1 and computer-use-preview models to plan actions and ground them with pixel-level precision.
Task Verifier: Evaluates task completion, ensuring reliability in task execution and providing a completion status.
Task Reviser: Adjusts task descriptions for incomplete executions, maintaining alignment with execution reality.
Follow-up Task Proposer and Task Summarizer: Iteratively build upon previous tasks, eventually generating higher-difficulty composite tasks.

By automating the entire process, the designers achieve a cost-efficient pipeline, reducing the expense associated with human annotations for high-quality datasets. The tasks are diverse, involving interactions ranging from web navigation to office software use, coding, and more, demonstrating practical applicability across various domains.

Results

Empirical evaluation highlights the efficacy of AgentSynth in producing challenging tasks. State-of-the-art LLM agents exhibit a sharp decline in performance as task difficulty increases, from 18% success at level 1 to 4% at level 6. These results underscore the discriminative power inherent in the pipeline, revealing significant potential for future improvements in agent capabilities. The benchmark is further validated by low costs, with each trajectory costing only $0.60—orders of magnitude cheaper than traditional human annotations.

Implications and Future Directions

Practically, AgentSynth represents a significant stride in synthetic data generation, particularly for evaluating and training LLM-based agents in generalist settings. Theoretical implications highlight the utility of task decomposition and information asymmetry in generating complex yet manageable datasets. Future research can extend this framework by incorporating more sophisticated environments and enhancing the agent's perceptual grounding capabilities. As the availability of high-quality datasets becomes increasingly critical, AgentSynth serves as a promising foundation for advancing generalist computer-use agents capable of performing complex, open-ended tasks in real-world settings.

In conclusion, AgentSynth provides a robust, scalable, and cost-efficient solution for generating diverse tasks that benchmark the capabilities of LLM agents. Through its innovative approach to task synthesis, it sets a benchmark for future developments in the agent-environment interaction landscape.