Papers
Topics
Authors
Recent
Search
2000 character limit reached

CUAVerifierBench: CUA Trajectory Benchmark

Updated 11 April 2026
  • CUAVerifierBench is a standardized benchmark that evaluates computer-use agent trajectory verification using natural-language goals, action histories, and high-frequency screenshots.
  • It splits data into Internal and Browserbase sets and employs dual-layer human annotation with outcome and process metrics such as Cohen’s κ and false positive rates.
  • The benchmark’s pipeline leverages rubric generation, top-k screenshot context selection, and two-pass multimodal scoring to enable reproducible, high-fidelity evaluations.

CUAVerifierBench is a standardized, human-labeled benchmark for the evaluation of computer-use agent (CUA) trajectory verification systems. It is designed to enable rigorous, reproducible comparison of verifier accuracy in assessing both process and outcome success for web-based task executions, where trajectories comprise natural-language goals, action histories, and high-frequency screenshots. The benchmark underpins developments in scalable, high-fidelity verification and has established new evaluation standards in the agent alignment and web-automation communities (Rosset et al., 5 Apr 2026).

1. Dataset Structure and Task Domains

CUAVerifierBench consists of two principal splits:

  • Internal split: 140 trajectories (used for ablations and system development).
  • Browserbase OM2W: 106 trajectories independently annotated by two external raters.

Each trajectory τ\tau entails:

  • A natural-language goal gg (e.g., “book the cheapest 3-night stay...”).
  • A stepwise sequence of user or agent actions a1,...,aTa_1, ..., a_T.
  • A multimodal history of MM screenshots, on average M47M \approx 47 per trajectory.

Task domains are broadly sampled and include web search, information retrieval (e.g., LinkedIn job listings), e-commerce (Amazon, AutoZone), event discovery (Eventbrite, Spotify), travel booking (Booking.com, AirAsia), geospatial/map queries, and online form completion (OpenTable, Resy). Data is not subdivided into train/val/test, but rather organized by internal “dev/ablation” (Internal) and external “test” (Browserbase OM2W) splits (Rosset et al., 5 Apr 2026).

2. Human Annotation and Rubric Methodology

The annotation pipeline employs a dual-layer label structure:

  • Outcome label (routr_\text{out}):
    • Binary, indicating task completion as judged by “a reasonable user.”
  • Process label (rprocr_\text{proc}):

    • A normalized continuous score in [0,1][0,1], defined as:

    rproc=iAearnediiAmaxPtsir_\text{proc} = \frac{\sum_{i \in \mathcal{A}} \mathrm{earned}_i}{\sum_{i \in \mathcal{A}} \mathrm{maxPts}_i}

    where A\mathcal{A} is the subset of applicable rubric criteria in the given context.

Human annotators are supplied with the goal gg0, the complete trajectory, and the (unscored) rubric criteria, and initially provide “UV-blind” (verifier-agnostic) judgments. After being informed of the Universal Verifier’s verdicts, they re-evaluate for consensus scoring. Each trajectory in the Browserbase split is labeled by two calibrating raters.

Rubric principles include:

  1. Construction from non-overlapping, specific criteria to avoid “phantom requirements.”
  2. Generation from gg1 alone, with scoring performed on gg2.
  3. Handling of conditional requirements—criteria are excluded if the antecedent does not hold.
  4. Two-pass scoring (actions alone and full-screenshot context) to reveal agent hallucinations or omissions.
  5. Segregated “side-effect” evaluation for penalizing undesired actions such as unsolicited cart additions.

3. Evaluation Metrics and Agreement

CUAVerifierBench employs established classification metrics and inter-annotator agreement statistics:

  • Standard metrics:

    • Accuracy, Precision, Recall, F1 as in:

    gg3 - False Positive Rate (FPR): gg4 - False Negative Rate (FNR): gg5 - Human–verifier agreement: gg6

  • Cohen’s gg7 for inter-annotator reliability:

gg8

  • Outcome (UV-blind): gg9
  • Outcome (UV-informed): a1,...,aTa_1, ..., a_T0
  • Process (binary, UV-blind): a1,...,aTa_1, ..., a_T1
  • Process (binarized at a1,...,aTa_1, ..., a_T2): a1,...,aTa_1, ..., a_T3
  • Universal Verifier–human agreement: a1,...,aTa_1, ..., a_T4 (outcome), a1,...,aTa_1, ..., a_T5 (process), matching human–human levels (Rosset et al., 5 Apr 2026).

4. Benchmark Results and Comparative Analysis

CUAVerifierBench enables fine-grained comparison among verifiers including Universal Verifier (UV), WebVoyager, and WebJudge. Metrics are reported separately for outcome and process scores across both splits. The following table summarizes outcome and process agreement as Cohen’s a1,...,aTa_1, ..., a_T6 and FPR:

Split Verifier a1,...,aTa_1, ..., a_T7 FPRa1,...,aTa_1, ..., a_T8 a1,...,aTa_1, ..., a_T9 FPRMM0
Internal WebVoyager@4o 0.31 0.45 0.17 0.52
WebJudge@o4-mini 0.44 0.22 0.32 0.25
UV (GPT-5.2) 0.64 0.01 0.59 0.04
Browserbase WebVoyager@4o 0.13 0.60 0.22 0.56
WebJudge@o4-mini 0.26 0.40 0.34 0.38
UV (GPT-5.2) 0.58 0.08 0.43 0.20

Results indicate that the Universal Verifier, leveraging a dedicated rubric-driven multimodal pipeline and GPT-5.2, achieves human-level MM1 and minimizes false positives compared to WebVoyager or WebJudge. Notably, backbone upgrades in comparator verifiers reduce FPR but increase false negatives, confirming that performance gains arise from architectural innovations rather than model scaling alone.

Ablation studies further demonstrate that rubric-scoring separation and systematic context management are critical for high-fidelity process and outcome verification, rather than backbone LLM choice in isolation (Rosset et al., 5 Apr 2026).

5. Rubric Generation, Multimodal Scoring, and Pipeline

The Universal Verifier’s pipeline—operating on CUAVerifierBench—proceeds as follows:

  1. Rubric Generation: Extract MM2 disjoint criteria from the natural-language goal MM3.
  2. Screenshot Relevance Matrix: Score screenshots MM4 against each criterion, forming a relevance matrix MM5.
  3. Top-MM6 Context Selection: Identify top-MM7 screenshots per criterion to balance thoroughness and efficiency.
  4. Two-Pass Scoring: First pass evaluates action-only context; the second incorporates full multimodal input.
  5. Side-Effect Detection: Explicit pass penalizes unrequested side effects.
  6. Outcome Verification: Outputs binary verdict of task completion.
  7. Failure Diagnosis: Assigns a failure code from a 24-category taxonomy for systematic error analysis.

This pipeline architecture is open-sourced in the UniversalVerifier class (Algorithm 1), and implemented to facilitate external benchmarking and research reproducibility. Data and code are available at https://github.com/microsoft/fara (Rosset et al., 5 Apr 2026).

6. Usage, Accessibility, and Benchmarking Practices

CUAVerifierBench is provided as a Python-accessible dataset and pipeline, with accompanying rubric templates, scoring instructions, and worked examples. The repository comprises both splits, full annotation metadata, prompt assets, and automated evaluation scripts. Researchers interact with the benchmark as follows:

MM8

This enables reproducible scoring, diagnostic reporting, and detailed quantitative analyses. A plausible implication is that this infrastructure streamlines both system development and comparative studies across agent alignment and multimodal verification research.

7. Impact, Limitations, and Future Directions

CUAVerifierBench establishes the first process-and-outcome human-labeled standard for CUA trajectory verification at relevant scale. Its design incentivizes verifiers to address both the correctness of final outcomes and the faithfulness of agent processes, filling gaps left by prior task success benchmarks that focused solely on outcomes or lacked formal human agreement studies.

Benchmark results show that architectural pipeline design—specifically rubric generation, context selection, and multimodal scoring—is the primary determinant of verifier success, rather than LLM backbone scaling alone. This suggests further improvements may derive from advances in rubric learning and multimodal interaction, rather than solely from larger LLMs.

As the benchmark is open to the research community, extensions are anticipated in trajectory complexity, domain coverage, and formalization of rubric construction. The split between process and outcome reward additionally points to new analyses in agent error modes and the challenge of aligning agents toward not just final but also intermediate behaviors (Rosset et al., 5 Apr 2026).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to CUAVerifierBench.