AgentTrek: Automated Web GUI Trajectories
- AgentTrek is an automated framework synthesizing high-quality, multi-step GUI agent trajectories from web tutorials using a tri-stage pipeline.
- It leverages a combination of machine labeling, sequence-to-sequence task specification, and vision-language guided replay to ensure cost-effective, scalable data generation.
- By integrating chain-of-thought reasoning with multimodal observations, AgentTrek enhances grounding, planning, and overall digital agent performance.
AgentTrek is an automated data generation and training framework designed to synthesize high-quality, multi-step GUI agent trajectories at scale by leveraging publicly available web tutorials, structured parsing, and guided execution via vision-LLM (VLM) agents. Its core methodology consists of a tri-stage pipeline that encompasses harvesting web content, generating structured task specifications, and performing VLM-guided replay within live web environments, culminating in a fully automated evaluation and curation process. By decoupling the data generation process from expensive human-labeled annotation and integrating chain-of-thought reasoning as well as multimodal data streams, AgentTrek provides a data foundation for state-of-the-art digital agents that can generalize across web platforms and observation modalities (Xu et al., 2024).
1. Data Harvesting and Tutorial Identification
The first stage targets the identification of tutorial-like documents from web-scale corpora. AgentTrek processes large text repositories (e.g., RedPajama, containing ≈20.8 billion URLs) using domain-specific prefiltering heuristics that scan for GUI action keywords (“click”, “type”, platform names) and enforce thresholds on keyword count and diversity. For example, a text is accepted if keyword density is at least 20, with four or more distinct keywords and repetition of essential terms. This reduces the search space to ≈68.8 million candidate tutorial segments.
To achieve high quality, AgentTrek employs a two-step machine labeling cascade:
- LLM Labeler: A small, manually-labeled seed set is expanded by prompting GPT-4o mini to provide binary labels (“step-by-step GUI instructions”: yes/no), yielding F1 ≈ 0.89 on held-out validation.
- FastText Classifier: A binary classifier is trained on 90,000 LLM- and human-labeled samples. On test data, it achieves precision=0.895, recall=0.895, F1=0.89, filtering to ≈18.8 million tutorial-verified texts (Xu et al., 2024).
This semi-supervised filtration ensures broad recall (92.7% on positive samples) while driving data collection costs to $0.89 per 1,000 segments.
2. Task Specification Generation
Each filtered text is parsed into a structured, machine-consumable task specification via sequence-to-sequence language modeling. The output schema encompasses:
- Task description ($dPI = \langle s_1, ..., s_n \rangleo^*tSpec(t) = (d_t, P_t, I_t, o^*_t)n \geq 20.89 per 1,000timg_tP$0), with the latter providing node-level semantic identifiers (bids).
- Action Space: The agent issues discrete GUI actions (e.g., click, type, moveTo, scroll) mapped automatically from high-level task instructions to pyautogui/Playwright commands.
- Agent Model: Qwen2-VL serves as the backbone VLM, with a NaViT image encoder and cross-modal transformer layers. At each step, the agent constructs context $P$1 containing the task description, current/previous observations, action and reasoning history, and remaining instructions. The output consists of a chain-of-thought reasoning string $P$2 and an action $P$3, both generated via greedy decoding from the model’s output head. This organization allows multimodal conditioning and context reuse (≈8,000 tokens per step).
- Pseudocode:
$I = \langle s_1, ..., s_n \rangle$2 This replay-through-execution produces detailed, multimodal trajectories, including both the chain-of-thought and action sequences.
4. Evaluation, Data Quality, and Representation
AgentTrek integrates an autonomous VLM-based evaluation module (GPT-4o) to curate trajectory quality. The evaluator is given the full sequence of task description, chain-of-thought steps $P$4, and executed actions $P$5; it outputs a binary verdict indicating trajectory success. Verified trajectories are accepted if the model’s output passes a confidence threshold $P$6. Empirically, this yields 84.0% accuracy against human annotators, competitive with other VLM-based evaluators on benchmarks (80–82% range).
Accepted trajectories include:
- Task metadata (JSON spec)
- Full textual record (description, observations, thoughts, actions)
- Visual assets (screenshots, video, final HTML, AXTree, Playwright DOM/network traces)
- All multimodal streams are preserved, allowing training with text-only, vision-only, or hybrid observation spaces.
5. Experimental Results and Benchmark Performance
AgentTrek-trained agents demonstrate significant improvements across standard benchmarks using both text-based and vision-based input modalities.
WebArena (Text-based)
- Qwen2.5-7B: 10.46% success (vs. Llama3-chat-8B: 3.32%)
- Qwen2.5-32B: 16.26% success
- GPT-4 (API): 14.41%
- AutoWebGLM: 18.20%
ScreenSpot Web (Grounding Accuracy)
- Qwen2-VL-7B baseline: 30.7% average
- SeeClick: 44.7%
- GPT-4+OmniParser: 67.0%
- Qwen2-VL-7B with AgentTrek tuning: 67.4% (matching GPT-4+OmniParser)
Multimodal Mind2Web (Image-Only)
- Qwen2-VL-7B with AgentTrek: Cross-Domain step success 42.1%
- Combined with Mind2Web: Cross-Domain step success 52.6%
Training with AgentTrek not only boosts grounding and planning accuracy but also enables cost-effective scaling, with the cost per high-quality, verified trajectory at \$P$78–$20 per trajectory typical of human-annotated pipelines (Xu et al., 2024).
6. Cost Model, Scale, and Broader Implications
The AgentTrek synthesis pipeline achieves its cost-effectiveness and scalability through several design choices:
| Cost Source | Cost per 1,000 | Details |
|---|---|---|
| Tagging/paraphrase | \$0.886 | LLM and classifier for tutorial extraction |
| Replay (guided) | \$215.36 | VLM execution in BrowserGym |
| VLM evaluation | \$3.104 | Automated GPT-4o-based filtering |
For Web-related tutorials (8, 9 success rate), the overall cost per trajectory is
0
thus, 10.551$ per verified trajectory.
The finalized dataset contains 10,398 multistep web trajectories, with an average 12.1 steps per trajectory, spanning 127 domains and 11 categories. Each includes full multimodal context, enabling generalization to both textual and vision-intensive agent settings.
Broader implications include the removal of manual annotation as a bottleneck in GUI agent development, enabling rapid scaling to millions of tasks across diverse platforms and observation modalities. A plausible implication is the feasibility of developing robust, chain-of-thought-capable agents that maintain high performance in planning and grounding benchmarks while incurring negligible marginal annotation cost.
7. Comparison to Related Approaches and Significance
AgentTrek distinguishes itself from previous approaches by:
- Full automation of the tutorial-to-dataset pipeline (harvesting, parsing, execution, evaluation)
- Multimodal coverage, encoding both vision-based and structure-based web observations alongside discrete GUI actions
- Automated, VLM-based trajectory success evaluation competitive with human verification
- Data cost reduction by more than one order of magnitude compared to human annotation workflows
When evaluated against leading baselines, the AgentTrek data pipeline enabled Qwen2-VL and Qwen2.5 models to match or surpass proprietary GPT-4-level performance in both grounding and planning domains. The integration of chain-of-thought reasoning into each trajectory further provides a foundation for the development of more advanced, interpretable digital assistants (Xu et al., 2024).