Structured Distillation of Web Agent Capabilities Enables Generalization

Published 9 Apr 2026 in cs.LG | (2604.07776v1)

Abstract: Frontier LLMs can navigate complex websites, but their cost and reliance on third-party APIs make local deployment impractical. We introduce Agent-as-Annotators, a framework that structures synthetic trajectory generation for web agents by analogy to human annotation roles, replacing the Task Designer, Annotator, and Supervisor with modular LLM components. Using Gemini 3 Pro as teacher, we generate 3,000 trajectories across six web environments and fine-tune a 9B-parameter student with pure supervised learning on the 2,322 that pass quality filtering. The resulting model achieves 41.5% on WebArena, surpassing closed-source models such as Claude 3.5 Sonnet (36.0%) and GPT-4o (31.5%) under the same evaluation protocol, and nearly doubling the previous best open-weight result (Go-Browse, 21.7%). Capabilities transfer to unseen environments, with an 18.2 percentage point gain on WorkArena L1 (an enterprise platform never seen during training) and consistent improvements across three additional benchmarks. Ablations confirm that each pipeline component contributes meaningfully, with Judge filtering, evaluation hints, and reasoning traces each accounting for measurable gains. These results demonstrate that structured trajectory synthesis from a single frontier teacher is sufficient to produce competitive, locally deployable web agents. Project page: https://agent-as-annotators.github.io

Abstract PDF Upgrade to Chat

Authors (2)

Summary

Structured Distillation for Web Agents: Framework, Implementation, and Empirical Outcomes

Framework and Motivation

The paper "Structured Distillation of Web Agent Capabilities Enables Generalization" [2604.07776] introduces the Agent-as-Annotators framework for systematically structuring large language model (LLM)-based synthetic trajectory generation in web agent tasks. The framework draws an analogy between traditional human annotation roles—Task Designer, Annotator, and Supervisor—and modular LLM components: Persona Generator & Task Generator (Task Designer), Agent (Annotator), and Judge (Supervisor). By replacing each human annotation function with an LLM-driven module, the framework enforces modularity and explicit quality control in synthetic trajectory pipelines.

(Figure 1)

Figure 1: The Agent-as-Annotators pipeline replaces three human annotation roles with LLM modules: synthesized personas and task intents with evaluation hints, LLM-based agent for execution, and Judge for trajectory validation.

Prior approaches (e.g., InSTA, NNetNav, Go-Browse) are recast as partial instantiations within this scheme, enabling principled comparison and ablation. Unlike retroactive task labeling approaches, Agent-as-Annotators provides grounded task generation with explicit evaluation hints, affording higher reliability in both training and evaluation signals. This modularity enables data quality control and systematic distillation from frontier models.

Pipeline Implementation

The implementation targets six complex self-hosted web environments (WebArena), leveraging Gemini~3~Pro as the teacher LLM for exploration, task synthesis, agent trajectories, and judging. Persona diversity is enforced by assigning 250 distinct LLM-synthesized user personas across all environments. The pipeline yields 3,000 synthesized tasks, each paired with structured evaluation hints. Only trajectories deemed successful by the Judge are retained, resulting in 2,322 task-complete trajectories (16,353 observation-action pairs).

The student model, Qwen3.5-9B (multimodal), is fine-tuned via standard SFT for two epochs using these supervised examples. Each step is formatted as observation input (accessibility tree, screenshot, goal) and action output, augmented with the teacher's explicit reasoning blocks.

Empirical Results: Generalization and Capability Transfer

A3-Qwen3.5-9B achieves 41.5% test success rate on WebArena, exceeding GPT-4o (31.5%) and Claude~3.5~Sonnet (36.0%) in the BrowserGym harness, and nearly doubling the previous best open-weight result (Go-Browse, 21.7%). The distilled student matches the performance of the 3× larger Qwen3.5-27B baseline, validating the efficacy of structured distillation. Importantly, cross-domain transfer is observed: on WorkArena~L1 (enterprise ServiceNow, never seen in training), the model achieves +18.2pp gain, indicating that trajectory synthesis fosters generalizable web agent primitives.

(Figure 4)

Figure 4: Cross-benchmark success rates for Qwen3.5-9B before and after fine-tuning; largest generalization gain is +18.2pp in a never-seen enterprise interface.

The distilled agent exhibits improved sample efficiency and compositional behavior across atomic web tasks (MiniWoB, +5.8pp) and visually grounded environments (VisualWebArena, +7.5pp). Qualitative comparisons demonstrate more efficient task completion with fewer steps and improved goal targeting.

(Figure 2)

Figure 2: Comparison of base vs. fine-tuned model behavior on a Shopping Admin task: the fine-tuned agent efficiently completes navigation and action in 2 steps, versus 10 confused actions.

(Figure 5)

Figure 5: On WorkArena L1, the fine-tuned model configures and submits an order in 5 actions, compared to the base model's failure over 15 actions.

The Role of Teacher Quality and Reasoning Budget

Teacher selection is paramount: Gemini~3~Pro (reduced thinking budget) outperforms newer variants and Flash configurations on all six environments. Training a student on high-quality teacher trajectories yields substantial performance improvements, whereas larger training quantities from lower-quality teachers do not compensate for decreased trajectory quality.

Notably, the paper observes a counterintuitive result: reducing the teacher's reasoning budget (concise reasoning traces) increases both trajectory success rate and downstream student performance, contradicting prior expectations from contemporary literature on scaling compute versus model parameters. The reasoning traces ablation confirms that intact, coherent teacher reasoning is essential: truncation or removal results in degraded student learning, with truncated traces being especially deleterious.

(Figure 3)

Figure 3: Teacher quality on A3-Synth (x-axis) directly predicts student performance on WebArena (y-axis); Gemini~3~Pro (reduced thinking) yields superior outcomes.

(Figure 9)

Figure 9: Success rate increases with training data to a plateau; abrupt reasoning trace truncation is more damaging than removal, emphasizing importance of reasoning integrity.

Pipeline Ablations and Data Scaling

Comprehensive ablations show that each pipeline component contributes substantially. Judge filtering (removal causes –4.5pp drop), evaluation hints (removal causes –2.4pp drop), and reasoning traces (removal causes –7.9pp drop) are all indispensable. Performance increases smoothly with training data scale, but diminishing returns are observed beyond ~1,430 trajectories, suggesting that further scaling within a fixed environment may yield marginal gains.

Qualitative Analysis: Compositionality and Skill Acquisition

Across benchmarks, the A3 agent demonstrates emergent behavior: efficient pathfinding, correct form input, and robust compositional reasoning in unseen environments. Comparative figures illustrate failure cases in the base model—such as aimless navigation, incorrect submission, or ineffective input formatting—contrasted with the distilled agent's direct task completion and goal alignment.

(Figure 6)

Figure 6: Fine-tuned agent successfully navigates ServiceNow to retrieve warranty expiration, while the base model fails after extensive navigation.

(Figure 7)

Figure 7: In visually grounded Reddit tasks, the fine-tuned agent completes comment submission in 2 actions, avoiding login traps encountered by the base model.

(Figure 8)

Figure 8: MiniWoB atomic task: fine-tuned agent submits correctly formatted time entry in 3 actions, versus base model's repetitive and incorrect input attempts.

Practical and Theoretical Implications

The research demonstrates that structured distillation from a single frontier LLM suffices to bridge the capability gap for small open-weight web agents, matching or surpassing closed-source models several times larger. The findings reinforce the dominance of data quality over data quantity and underscore the critical value of intact teacher reasoning, modular judging, and explicit evaluation hints. The modularity of the Agent-as-Annotators framework simplifies adaptation and benchmarking of synthetic trajectory pipelines and allows direct comparison between approaches.

Practically, the framework enables locally deployable agent solutions that do not rely on expensive or privacy-risky API access. The results suggest that modular pipelines with high-quality teacher trajectories and rigorous filtering are more impactful than indiscriminate scaling. The empirical transfer across compositional, enterprise, visual, and atomic web environments suggests that the distilled skills are not environment-specific shortcuts but general web interaction primitives.

Theoretically, the observations about reasoning budget and trajectory quality invite further investigation into optimal teacher signal synthesis and its interaction with student model capacity. Additionally, the modular design facilitates integration with reinforcement learning refinement, iterative self-improvement, and multi-agent systems, potentially compounding generalization gains.

Future Directions

Key avenues for future research include scaling environment diversity, applying iterative self-distillation with student-generated traces, integrating RL refinement, and validating judge performance against human annotations. The modularity of the framework provides clear pathways for swapping teacher models, expanding persona pools, or integrating alternative evaluation protocols. Extending compositional task sequencing and cycle-based self-improvement strategies can potentially further amplify agent capabilities.

Conclusion

"Structured Distillation of Web Agent Capabilities Enables Generalization" establishes a modular, quality-first pipeline for LLM-driven web agent training. Empirical validation confirms that small open-weight models acquire generalist web skills via structured distillation, achieving performance competitive with frontier closed-source models and demonstrating strong transfer in out-of-distribution settings. The explicit ablation and cross-benchmark analysis position Agent-as-Annotators as a robust baseline for web agent synthetic data pipelines. The release of full datasets, pipeline code, and checkpoints is set to accelerate reproducible research and broad accessibility in autonomous web agent deployment.