WebForge: Breaking the Realism-Reproducibility-Scalability Trilemma in Browser Agent Benchmark

Published 13 Apr 2026 in cs.AI and cs.CV | (2604.10988v1)

Abstract: Existing browser agent benchmarks face a fundamental trilemma: real-website benchmarks lack reproducibility due to content drift, controlled environments sacrifice realism by omitting real-web noise, and both require costly manual curation that limits scalability. We present WebForge, the first fully automated framework that resolves this trilemma through a four-agent pipeline -- Plan, Generate, Refine, and Validate -- that produces interactive, self-contained web environments end-to-end without human annotation. A seven-dimensional difficulty control framework structures task design along navigation depth, visual complexity, reasoning difficulty, and more, enabling systematic capability profiling beyond single aggregate scores. Using WebForge, we construct WebForge-Bench, a benchmark of 934 tasks spanning 7 domains and 3 difficulty levels. Multi-model experiments show that difficulty stratification effectively differentiates model capabilities, while cross-domain analysis exposes capability biases invisible to aggregate metrics. Together, these results confirm that multi-dimensional evaluation reveals distinct capability profiles that a single aggregate score cannot capture. Code and benchmark are publicly available at https://github.com/yuandaxia2001/WebForge.

Abstract PDF Upgrade to Chat

Authors (4)

Summary

The paper introduces an automated LLM-driven pipeline that constructs realistic, reproducible, and scalable benchmarks for browser agents.
It employs a four-stage agent framework—plan, generation, refinement, and validation—to generate interactive web tasks with controlled multi-dimensional difficulty.
Empirical evaluations reveal that the benchmarks effectively differentiate agent performance, exposing modality impacts and domain-specific challenges.

WebForge: Automated Multi-Dimensional Benchmark Generation for Browser Agents

Motivation and Positioning

WebForge introduces a fully automated pipeline for constructing browser agent benchmarks that robustly address the entrenched realism–reproducibility–scalability trilemma present in prior art. Previous benchmarks using real websites achieve ecological validity but rapidly decay due to content drift, while controlled environments ensure reproducibility at the expense of stimulus realism and require unsustainable manual curation. Existing automated generation strategies are limited to non-interactive or static tasks, lacking support for multi-dimensional, interactive, and noise-robust web environments. WebForge provides a comprehensive solution by leveraging multi-agent LLM pipelines and principled difficulty control, demonstrated to produce challenging, realistically noisy, scalable, and human-annotation-free testbeds for browser-based autonomous agents.

Automated Pipeline Architecture

The WebForge pipeline consists of four sequentially orchestrated LLM-driven agents: Plan Agent, Generation Agent, Refinement Agent, and Validation Agent. Each agent has a well-defined interface and operational semantics, enabling fully reproducible and interpretable benchmark construction.

Figure 1: The four-stage pipeline for automated web environment and task generation, encapsulating plan design, website synthesis, environment refinement, and browser-level validation.

Plan Agent leverages a dual-stage LLM process. An initial high-temperature draft produces diverse task blueprints, while a subsequent low-temperature refinement stage ensures logical soundness, compliance with a seven-dimensional, three-level difficulty vector, and appropriate domain calibration.
Figure 2: The Plan Agent’s dual-LLM workflow supporting both creative generation and rigorous constraint enforcement for blueprint synthesis.
Generation Agent executes the plan by programmatic website construction, using real-world design/data priors, and instituting anti-cheating and answer obfuscation mechanisms via encrypted, code-mediated solution checking.
Refinement Agent systematically evaluates and upgrades the output, injecting real-world noise (pop-ups, cookie dialogs, stochastic events, network delays) and fixing all dead links and agent-unfriendly UI idioms (e.g., blocking alert dialogs).
Validation Agent guarantees actual solvability by executing the annotated solution path inside a Chromium instance, replaying browser actions, and performing strict programmatic comparisons of agent output against ground truth. Tasks failing validation are repaired or discarded, guaranteeing all benchmark items are truly actionable by browser agents.

Seven-Dimensional Difficulty Control and Benchmark Structure

WebForge imposes a seven-dimensional task difficulty schema: Jump Depth, Jump Breadth, Page Interaction complexity, Visual Complexity, Information Complexity, Reasoning/Calculation, and Risk Factor (irreversibility). Each axis is discretized into three levels, yielding combinatorially diverse scenario configurations. Aggregate difficulty is regulated by compositional constraints (e.g., Level-3 tasks require multiple axes at their hardest settings), producing benchmarks with strong stratification properties both within and across domains.

The pipeline constructs WebForge-Bench, a corpus of 934 tasks over 7 web domains and 3 global difficulty tiers, with final composition reflecting realistic distributions of risk, visual, and navigation complexity.

Empirical Evaluation and Capability Profiling

Comprehensive experiments on WebForge-Bench include an array of SOTA closed and open-source agents, text-only and multimodal. Multiple findings stand out:

Difficulty Stratification: All models, including frontier multimodal LLMs (e.g., Gemini-3-Pro, Claude-4.5-Sonnet), show steep accuracy decay as difficulty increases. Level-1 tasks yield $>$ 73% accuracy for top models; Level-3 produces strong separation (e.g., Gemini-3-Pro 58.0% vs. Qwen3-Omni-30B 2.4%). This validates the discriminative efficacy of the multi-dimensional framework.
Domain Sensitivity: Cross-domain analysis reveals substantial and consistent performance shifts: info retrieval tasks are universally easier, while consumer transaction and content moderation domains manifest distinct failure modes, especially in irreversible or nuanced policy judgment scenarios. These distinctions are not visible under conventional aggregate scoring, highlighting the necessity of structured multi-axis benchmarks.
Modality Impact: Removal of visual input consistently decreases accuracy by 14–16 points for multimodal models, and the effect amplifies with increasing task difficulty.

Ablation studies confirm the each pipeline component’s critical role: omission of plan refinement or post-generation refinement leads to substantial declines in validation pass rates (from 74.1% to 51.4%).

Robust Anti-Cheating Mechanisms

WebForge enforces a final-state evaluation paradigm: agents’ outputs are judged only on end-state correctness, not trajectory, and sensitive solution data is encrypted or gated by operational code. The presence of adversarial answer codes ensures that partial mistake patterns are mapped to plausible, but strictly incorrect, outputs—eliminating source-level answer leakage and facilitating precise diagnostic error analysis. This obviates the need for ad hoc, potentially subjective semantic reward functions that plague prior work.

Practical and Theoretical Implications

WebForge conclusively demonstrates that highly realistic, reproducible, and scalable browser agent benchmarks can be fully automated. By escaping the realism–reproducibility–scalability trilemma, the methodology enables:

Creation of continuously updating testbeds immune to the decay and maintenance costs associated with manual curation and content drift.
Systematic, interpretable diagnosis of agent weaknesses along multiple cognitive-action axes, spurring targeted algorithmic improvements.
Generation of arbitrary quantities of labeled, browser-executable training data as a future extension, addressing the paucity of high-quality RL trajectories for web agents.
A robust platform for research into open questions on multi-modal grounding (text/image), sim-to-real transfer, and agentic error recovery under realistically noisy and diverse web scenarios.

The pipeline’s compositional, modular nature also makes it adaptable for future inclusion of simulated backends, collaborative workflows, and additional forms of visual or interaction noise, which are necessary as agent benchmarks are pushed toward more challenging operational frontiers.

Conclusion

WebForge provides a robust methodology and implementation for multi-dimensional, strictly automated, browser agent benchmarking. Unlike previous methods, it supplies task generation, interactive web environment synthesis, real-web disturbance injection, and machine-verified solvability checks, all without human annotation. The dataset and empirical results validate the effectiveness and necessity of multi-axis difficulty control for comprehensive agent evaluation. This pipeline will underpin the next generation of RL/LLM research on practical and diverse browser automation agents, and its extension to automated training corpora is an obvious and promising avenue.

Markdown Report Issue