Papers
Topics
Authors
Recent
2000 character limit reached

WebChoreArena: LLM Web Agent Benchmark

Updated 22 November 2025
  • WebChoreArena is a reproducible benchmark designed to evaluate LLM-driven web agents on complex, multi-step web chores.
  • It tests capabilities in memory management, arithmetic reasoning, and rule persistence through over 532 hand-curated tasks.
  • The benchmark employs dockerized simulations of Shopping, Admin, Reddit, and GitLab to ensure controlled, precise evaluations.

WebChoreArena is a fully reproducible benchmark specifically designed to evaluate the capabilities of LLM-driven web agents on realistic, labor-intensive web chores that go beyond the scope of general browsing tasks. Launched as a multi-domain extension of the established WebArena suite, WebChoreArena comprises 532 hand-curated tasks that systematically test agents’ abilities in memory management, arithmetic reasoning, and rule persistence across extended multi-page workflows. The benchmark is constructed atop four simulated, dockerized web environments: e-commerce (Shopping), store administration (Admin), forum (Reddit), and code repository management (GitLab). WebChoreArena provides a rigorously controlled framework for quantifying advances and limitations in next-generation web agents, exposing challenges largely unaddressed by previous benchmarks (Miyai et al., 2 Jun 2025, Yang et al., 8 Oct 2025).

1. Motivation and Conceptual Foundation

As LLM-based web agents demonstrate increasing fluency in navigation and single-step interactions, once-challenging benchmarks such as WebArena no longer discriminate effectively between state-of-the-art systems. WebChoreArena was devised to stress-test agent capabilities on tasks that require: (i) the retrieval and retention of large information sets; (ii) multi-step arithmetic over aggregated data; and (iii) persistence of complex rules or goals across dozens of sequential navigation hops. The explicit objective is to bridge the gap between simulated agent success and the types of information-heavy “chores” users routinely avoid—such as tallying transaction data, extracting and aggregating review scores, and applying context-dependent rules over prolonged action sequences (Miyai et al., 2 Jun 2025).

2. Task Categories and Design

WebChoreArena distributes its 532 new tasks across four canonical domains and 65 cross-site settings, with each task categorized into one of four “chore” archetypes. The three principal challenge categories are:

  • Massive Memory Tasks: Require agents to extract, store, and operate on large lists (e.g., all product SKUs with review scores). No single value suffices; agents must effectively chunk, serialize, and later recall all relevant observations.
  • Calculation Tasks: Involve arithmetic over previously retrieved data, including summing, averaging, and threshold evaluations. Precision is mandatory, as computational errors yield outright failures (e.g., sum the comments on the top 40 Reddit posts).
  • Long-Term Memory Tasks: Require persistent retention and later application of rules or parameters acquired at the start of a workflow—often after traversing multiple unrelated pages (e.g., recall discount eligibility from a policies page to optimize check-out choices).

Each task is delivered via templated, unambiguous instructions to reduce annotation noise and isolate agent model limitations rather than human-generated ambiguities. Templates standardize interaction styles across settings, and reproducibility is strictly enforced through Docker-based instances, fixed seeds, and identical action schemas and observation types as WebArena (Miyai et al., 2 Jun 2025, Yang et al., 8 Oct 2025).

3. Simulation Environments and Evaluation Framework

WebChoreArena inherits its simulated environments from WebArena, encompassing:

  • Shopping: E-commerce storefront supporting product discovery, filtering, and transactional operations.
  • Admin: Store backend with reporting, inventory, and metadata management.
  • Reddit: Forum simulation with posts, comments, and moderation workflows.
  • GitLab: Repository management including issues, merge requests, and activity summary.

Tasks across these environments are instantiated deterministically for complete reproducibility. The same agent APIs (BrowserGym, AgentOccam) interface with both benchmarks interchangeably, permitting direct attribution of performance deviations to task complexity rather than environmental drift (Miyai et al., 2 Jun 2025).

Evaluation Metrics

Three core metrics quantify performance:

  • Accuracy(A): Accuracy(A)=1Ti=1TPassi(A)\mathrm{Accuracy}(A) = \frac{1}{T} \sum_{i=1}^T\mathrm{Pass}_i(A), where TT is the task count and Passi(A)=1\mathrm{Pass}_i(A)=1 iff the agent’s response matches ground truth under the category-appropriate matching protocol.
    • string_match: exact, must_include, or fuzzy (semantic equality per GPT-4o).
    • url_match: endpoint URL equivalence.
    • program_html: post-interaction HTML values (queried via CSS/XPath).
  • Mean Steps(A): Average step count across all completed tasks.
  • Success Rate (alternative nomenclature): Proportion of chores completed with exact ground-truth outputs, as formalized in (Yang et al., 8 Oct 2025).

4. Comparative Agent Performance

WebChoreArena exposes a significant difficulty gradient compared to WebArena. The following accuracy measurements (AgentOccam, representative runs) evidence this contrast (Miyai et al., 2 Jun 2025):

LLM Shopping Admin Reddit GitLab Cross Overall Δ(WebArena)
WebArena
GPT-4o 44.0 66.0 38.9 10.3 37.4 42.8
Claude 3.7 49.5 74.5 50.0 13.8 49.7 52.0
Gemini 2.5 Pro 53.3 75.5 51.7 10.3 54.5 54.8
WebChoreArena
GPT-4o 4.5 9.9 7.1 0.0 10.3 6.8 -36.0
Claude 3.7 28.8 23.1 22.8 7.7 27.4 23.5 -28.5
Gemini 2.5 Pro 42.4 44.0 38.6 10.8 41.9 37.8 -17.0

Results on WebChoreArena display pronounced declines in accuracy (up to -36.0 points for GPT-4o), illustrating the benchmark's escalated demands in information retention, precise arithmetic, and context carryover. Memory-management dominates success in Massive Memory tasks, while calculation accuracy degrades as the operand count increases (N>15N > 15) (Miyai et al., 2 Jun 2025). In more recent efforts, decomposition and re-planning strategies (WebDART) increase success rates (e.g., AgentOccam: 21.6% baseline \rightarrow 31.2% with WebDART, GPT-5 backbone) and decrease navigation step counts by significant margins (e.g., Shopping: 32.9 \rightarrow 18.2 average steps) (Yang et al., 8 Oct 2025).

5. Distinguishing Features and Benchmark Advances

WebChoreArena departs from general web agent benchmarks in both task structure and evaluation philosophy:

  • Task Length and Complexity: Chores require cross-page retrievals, data filtering/sorting, and multi-step reasoning, as opposed to WebArena’s predominantly single-action goals.
  • Evaluation Focus: Direct measurement of reasoning, memory, and execution skills rather than simple navigation fluency.
  • Templates and Annotation Quality: Rigid, reproducible templates eradicate ambiguity, ensuring that failure modes reflect intrinsic agent deficits.

An explicit recommendation is to pursue explicit external memory strategies, improved tool integration (e.g., calculator invocation in arithmetic-heavy steps), and visual-textual fusion methods to address hallucination during screenshot-based exploration. There is also a call for extending evaluation to real-world web instances, leveraging deterministic recording/replay for reproducibility outside simulation (Miyai et al., 2 Jun 2025).

6. Impact, Limitations, and Future Directions

WebChoreArena is positioned as a high-fidelity standard for measuring the advancement of LLM-powered web agents toward robust, real-world automation. Its sharper separation of agent capabilities—compared to the saturation trends observed in earlier benchmarks—enables clear attribution of technological progress. Nonetheless, even the most recent top-performing systems (Gemini 2.5 Pro, GPT-5 with WebDART) remain well below “human-like” accuracy, with significant challenges persisting on Massive Memory and Long-Term Memory chores.

Anticipated directions include real-web instantiation of the benchmark, innovations in memory-augmented agent design, and hierarchically compositional planning frameworks—such as dynamic objective decomposition and adaptive re-planning—as demonstrated by the gains of WebDART on the benchmark (Yang et al., 8 Oct 2025).

By foregrounding long-horizon information processing and reasoning, WebChoreArena provides a rigorous and reproducible platform for the controlled paper and improvement of web agent architectures. Its adoption is likely to catalyze advances that address the current state-of-the-art's memory, calculation, and multi-step execution bottlenecks.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (2)
Slide Deck Streamline Icon: https://streamlinehq.com

Whiteboard

Forward Email Streamline Icon: https://streamlinehq.com

Follow Topic

Get notified by email when new papers are published related to WebChoreArena.