Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
194 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
46 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

AgentTrek: Agent Trajectory Synthesis via Guiding Replay with Web Tutorials (2412.09605v2)

Published 12 Dec 2024 in cs.CL

Abstract: Graphical User Interface (GUI) agents can automate complex tasks across digital environments, but their development is hindered by the scarcity of high-quality trajectory data for training. Existing approaches rely on expensive human annotation, making them unsustainable at scale. We propose AgentTrek, a scalable data synthesis pipeline that generates web agent trajectories by leveraging publicly available tutorials. Our three-stage method: (1) automatically harvests and filters tutorial-like texts from the internet using a specialized classification model, (2) transforms these texts into structured task specifications with step-by-step instructions, and (3) employs a visual-LLM (VLM) agent to execute these instructions in real environments, while a VLM-based evaluator verifies trajectory correctness. The synthesized trajectories encompass multiple modalities, including text-based HTML observations with function-calling API actions, and vision-based screenshot observations with pixel-level actions. This multimodal data, enriched with chain-of-thought reasoning, enables agents to achieve state-of-the-art performance on both textual web browsing benchmarks (e.g., WebArena) and visual web grounding and browsing benchmarks (e.g., ScreenSpot Web and Multimodal Mind2Web). Furthermore, our fully automated approach significantly reduces data collection costs, achieving a cost of just $0.55 per high-quality trajectory without human annotators. Our work demonstrates that guided replay using web tutorials is a practical and scalable strategy for training advanced GUI agents, paving the way for more capable and autonomous digital assistants.

Summary

  • The paper presents a novel data synthesis methodology that transforms web tutorials into actionable GUI agent trajectories.
  • The synthesized trajectories enable agents to outperform traditional training methods in multi-step, context-rich tasks.
  • The approach significantly reduces annotation costs while delivering a multimodal dataset for scalable agent training.

Overview of AgentTrek: A Pipeline for Synthesizing Web Agent Trajectories

This essay discusses "AgentTrek: Agent Trajectory Synthesis via Guiding Replay with Web Tutorials," a research paper that presents a novel methodology for generating trajectory data for Graphical User Interface (GUI) agents. The authors introduce an innovative pipeline called AgentTrek, which addresses a significant limitation in the development of GUI agents—the lack of extensive, high-quality trajectory datasets required for effective training and enhanced agent capability.

AgentTrek is a scalable data synthesis pipeline designed to automate the generation of web agent trajectories by utilizing web tutorials as a data source. The pipeline involves parsing and converting these tutorials into structured task goals with step-by-step instructions, which are then executed in real, digital environments by a Visual-LLM (VLM) agent. The generated trajectories are evaluated and validated for correctness through a VLM-based evaluator.

Key Contributions and Findings

  1. Data Synthesis Methodology: AgentTrek leverages internet-sourced web tutorials, automatically transformed into actionable tasks. These tasks undergo execution simulation in a virtual environment, and a VLM observer ensures trajectory quality through evaluation. This method reduces reliance on costly human annotation and adheres to scalability demands, offering a practical solution for training GUI agents at scale.
  2. Enhanced Agent Performance: The paper reports significant improvements in grounding and planning performance of GUI agents trained with synthesized trajectories from the AgentTrek pipeline. A noted outcome is that agents trained with these synthesized datasets outperformed those trained on existing datasets, demonstrating superior efficacy in multi-step, context-rich tasks.
  3. Cost-Efficiency: AgentTrek introduces a method that is notably more economical than traditional data collection, indicating advantageous scalability for large-scale GUI agent training. The cost analysis reveals a significant reduction in expenses compared to human-annotated data sources.
  4. Multimodal Dataset Characteristics: The synthesized datasets integrate various components, including visual context, intermediary reasoning, and structured actions. This dataset encompasses comprehensive details that mediate the effective training of visual web agents, making it a rich multimodal resource for GUI agent development.

Implications and Future Directions

The research underscores the practicality of guided replay using web tutorials as a viable strategy for large-scale GUI agent training. This approach implies future systems could increasingly rely on self-supervised data generation methods, further reducing dependence on human-annotated datasets. The blend of visual and LLMs within the pipeline highlights the forward trajectory for creating agents capable of complex decision-making and autonomous task management.

Future developments could build upon this framework to explore enhanced integrations with real-time learning systems for cross-domain applications, ultimately advancing the capabilities of autonomous digital agents. The implications are broad-ranging, harboring potential utility in various domains requiring streamlined human-computer interactions, such as automated customer service, intelligent virtual assistants, and beyond.

AgentTrek sets the stage for continued exploration in the realms of data synthesis within AI research, promising a trajectory toward more sophisticated and capable agents. The researchers offer a compelling insight into an evolving narrative toward utilizing readily available web resources, efficiently translating them into synthetic data pipelines that could define the next leap in agentic automation.

Youtube Logo Streamline Icon: https://streamlinehq.com