AgentTrek: Automated Web GUI Trajectories

Updated 6 May 2026

AgentTrek is an automated framework synthesizing high-quality, multi-step GUI agent trajectories from web tutorials using a tri-stage pipeline.
It leverages a combination of machine labeling, sequence-to-sequence task specification, and vision-language guided replay to ensure cost-effective, scalable data generation.
By integrating chain-of-thought reasoning with multimodal observations, AgentTrek enhances grounding, planning, and overall digital agent performance.

AgentTrek is an automated data generation and training framework designed to synthesize high-quality, multi-step GUI agent trajectories at scale by leveraging publicly available web tutorials, structured parsing, and guided execution via vision-LLM (VLM) agents. Its core methodology consists of a tri-stage pipeline that encompasses harvesting web content, generating structured task specifications, and performing VLM-guided replay within live web environments, culminating in a fully automated evaluation and curation process. By decoupling the data generation process from expensive human-labeled annotation and integrating chain-of-thought reasoning as well as multimodal data streams, AgentTrek provides a data foundation for state-of-the-art digital agents that can generalize across web platforms and observation modalities (Xu et al., 2024).

1. Data Harvesting and Tutorial Identification

The first stage targets the identification of tutorial-like documents from web-scale corpora. AgentTrek processes large text repositories (e.g., RedPajama, containing ≈20.8 billion URLs) using domain-specific prefiltering heuristics that scan for GUI action keywords (“click”, “type”, platform names) and enforce thresholds on keyword count and diversity. For example, a text is accepted if keyword density is at least 20, with four or more distinct keywords and repetition of essential terms. This reduces the search space to ≈68.8 million candidate tutorial segments.

To achieve high quality, AgentTrek employs a two-step machine labeling cascade:

LLM Labeler: A small, manually-labeled seed set is expanded by prompting GPT-4o mini to provide binary labels (“step-by-step GUI instructions”: yes/no), yielding F1 ≈ 0.89 on held-out validation.
FastText Classifier: A binary classifier is trained on 90,000 LLM- and human-labeled samples. On test data, it achieves precision=0.895, recall=0.895, F1=0.89, filtering to ≈18.8 million tutorial-verified texts (Xu et al., 2024).

This semi-supervised filtration ensures broad recall (92.7% on positive samples) while driving data collection costs to $0.89 per 1,000 segments.

2. Task Specification Generation

Each filtered text is parsed into a structured, machine-consumable task specification via sequence-to-sequence language modeling. The output schema encompasses:

Task description ($d $)</li> <li>List of prerequisites ($ P $)</li> <li>Ordered step-by-step instructions ($ I = \langle s_1, ..., s_n \rangle $)</li> <li>Target platform, object type, web URL, and expected result ($ o^* $)</li> </ul> <p>For every tutorial$ t $, this yields$ Spec(t) = (d_t, P_t, I_t, o^*_t) $. Specifications are only retained when the step count$ n \geq 2 $, automating the selection of substantive multi-step processes. <a href="https://www.emergentmind.com/topics/vocabulary-assistant-llm-gpt-4o" title="" rel="nofollow" data-turbo="false" class="assistant-link" x-data x-tooltip.raw="">GPT-4o</a> mini is used to generate these specifications in JSON format, keeping annotation cost to$ 0.89 per 1,000 $entries.</p> <h2 class='paper-heading' id='guided-replay-and-trajectory-synthesis'>3. Guided Replay and Trajectory Synthesis</h2> <p>Given a structured specification, AgentTrek employs a <a href="https://www.emergentmind.com/topics/vision-language-agent-vlagent" title="" rel="nofollow" data-turbo="false" class="assistant-link" x-data x-tooltip.raw="">vision-language agent</a> to execute the task within a live Chromium environment using <a href="https://www.emergentmind.com/topics/browsergym" title="" rel="nofollow" data-turbo="false" class="assistant-link" x-data x-tooltip.raw="">BrowserGym</a>. Each execution comprises:</p> <ul> <li><strong>Observations:</strong> For each timestep$ t $, the agent receives a screenshot ($ img_t $) and an accessibility tree ($ P$0), with the latter providing node-level semantic identifiers (bids).
Action Space: The agent issues discrete GUI actions (e.g., click, type, moveTo, scroll) mapped automatically from high-level task instructions to pyautogui/Playwright commands.
Agent Model: Qwen2-VL serves as the backbone VLM, with a NaViT image encoder and cross-modal transformer layers. At each step, the agent constructs context $P$1 containing the task description, current/previous observations, action and reasoning history, and remaining instructions. The output consists of a chain-of-thought reasoning string $P$2 and an action $P$3, both generated via greedy decoding from the model’s output head. This organization allows multimodal conditioning and context reuse (≈8,000 tokens per step).
Pseudocode:

$I = \langle s_1, ..., s_n \rangle$2 This replay-through-execution produces detailed, multimodal trajectories, including both the chain-of-thought and action sequences.

4. Evaluation, Data Quality, and Representation

AgentTrek integrates an autonomous VLM-based evaluation module (GPT-4o) to curate trajectory quality. The evaluator is given the full sequence of task description, chain-of-thought steps $P$4, and executed actions $P$5; it outputs a binary verdict indicating trajectory success. Verified trajectories are accepted if the model’s output passes a confidence threshold $P$6. Empirically, this yields 84.0% accuracy against human annotators, competitive with other VLM-based evaluators on benchmarks (80–82% range).

Accepted trajectories include:

Task metadata (JSON spec)
Full textual record (description, observations, thoughts, actions)
Visual assets (screenshots, video, final HTML, AXTree, Playwright DOM/network traces)
All multimodal streams are preserved, allowing training with text-only, vision-only, or hybrid observation spaces.

5. Experimental Results and Benchmark Performance

AgentTrek-trained agents demonstrate significant improvements across standard benchmarks using both text-based and vision-based input modalities.

WebArena (Text-based)

Qwen2.5-7B: 10.46% success (vs. Llama3-chat-8B: 3.32%)
Qwen2.5-32B: 16.26% success
GPT-4 (API): 14.41%
AutoWebGLM: 18.20%

ScreenSpot Web (Grounding Accuracy)

Qwen2-VL-7B baseline: 30.7% average
SeeClick: 44.7%
GPT-4+OmniParser: 67.0%
Qwen2-VL-7B with AgentTrek tuning: 67.4% (matching GPT-4+OmniParser)

Multimodal Mind2Web (Image-Only)

Qwen2-VL-7B with AgentTrek: Cross-Domain step success 42.1%
Combined with Mind2Web: Cross-Domain step success 52.6%

Training with AgentTrek not only boosts grounding and planning accuracy but also enables cost-effective scaling, with the cost per high-quality, verified trajectory at \$P$78–$20 per trajectory typical of human-annotated pipelines (Xu et al., 2024).

6. Cost Model, Scale, and Broader Implications

The AgentTrek synthesis pipeline achieves its cost-effectiveness and scalability through several design choices:

Cost Source	Cost per 1,000	Details
Tagging/paraphrase	\$0.886	LLM and classifier for tutorial extraction
Replay (guided)	\$215.36	VLM execution in BrowserGym
VLM evaluation	\$3.104	Automated GPT-4o-based filtering

For Web-related tutorials ( $P$ 8, $P$ 9 success rate), the overall cost per trajectory is

$I = \langle s_1, ..., s_n \rangle$ 0

thus, $I = \langle s_1, ..., s_n \rangle$ 10.551$ per verified trajectory.

The finalized dataset contains 10,398 multistep web trajectories, with an average 12.1 steps per trajectory, spanning 127 domains and 11 categories. Each includes full multimodal context, enabling generalization to both textual and vision-intensive agent settings.

Broader implications include the removal of manual annotation as a bottleneck in GUI agent development, enabling rapid scaling to millions of tasks across diverse platforms and observation modalities. A plausible implication is the feasibility of developing robust, chain-of-thought-capable agents that maintain high performance in planning and grounding benchmarks while incurring negligible marginal annotation cost.

AgentTrek distinguishes itself from previous approaches by:

Full automation of the tutorial-to-dataset pipeline (harvesting, parsing, execution, evaluation)
Multimodal coverage, encoding both vision-based and structure-based web observations alongside discrete GUI actions
Automated, VLM-based trajectory success evaluation competitive with human verification
Data cost reduction by more than one order of magnitude compared to human annotation workflows

When evaluated against leading baselines, the AgentTrek data pipeline enabled Qwen2-VL and Qwen2.5 models to match or surpass proprietary GPT-4-level performance in both grounding and planning domains. The integration of chain-of-thought reasoning into each trajectory further provides a foundation for the development of more advanced, interpretable digital assistants (Xu et al., 2024).

Markdown Report Issue Upgrade to Chat

References (1)

AgentTrek: Agent Trajectory Synthesis via Guiding Replay with Web Tutorials (2024)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to AgentTrek.