CUAVerifierBench: CUA Trajectory Benchmark
- CUAVerifierBench is a standardized benchmark that evaluates computer-use agent trajectory verification using natural-language goals, action histories, and high-frequency screenshots.
- It splits data into Internal and Browserbase sets and employs dual-layer human annotation with outcome and process metrics such as Cohen’s κ and false positive rates.
- The benchmark’s pipeline leverages rubric generation, top-k screenshot context selection, and two-pass multimodal scoring to enable reproducible, high-fidelity evaluations.
CUAVerifierBench is a standardized, human-labeled benchmark for the evaluation of computer-use agent (CUA) trajectory verification systems. It is designed to enable rigorous, reproducible comparison of verifier accuracy in assessing both process and outcome success for web-based task executions, where trajectories comprise natural-language goals, action histories, and high-frequency screenshots. The benchmark underpins developments in scalable, high-fidelity verification and has established new evaluation standards in the agent alignment and web-automation communities (Rosset et al., 5 Apr 2026).
1. Dataset Structure and Task Domains
CUAVerifierBench consists of two principal splits:
- Internal split: 140 trajectories (used for ablations and system development).
- Browserbase OM2W: 106 trajectories independently annotated by two external raters.
Each trajectory entails:
- A natural-language goal (e.g., “book the cheapest 3-night stay...”).
- A stepwise sequence of user or agent actions .
- A multimodal history of screenshots, on average per trajectory.
Task domains are broadly sampled and include web search, information retrieval (e.g., LinkedIn job listings), e-commerce (Amazon, AutoZone), event discovery (Eventbrite, Spotify), travel booking (Booking.com, AirAsia), geospatial/map queries, and online form completion (OpenTable, Resy). Data is not subdivided into train/val/test, but rather organized by internal “dev/ablation” (Internal) and external “test” (Browserbase OM2W) splits (Rosset et al., 5 Apr 2026).
2. Human Annotation and Rubric Methodology
The annotation pipeline employs a dual-layer label structure:
- Outcome label ():
- Binary, indicating task completion as judged by “a reasonable user.”
- Process label ():
- A normalized continuous score in , defined as:
where is the subset of applicable rubric criteria in the given context.
Human annotators are supplied with the goal 0, the complete trajectory, and the (unscored) rubric criteria, and initially provide “UV-blind” (verifier-agnostic) judgments. After being informed of the Universal Verifier’s verdicts, they re-evaluate for consensus scoring. Each trajectory in the Browserbase split is labeled by two calibrating raters.
Rubric principles include:
- Construction from non-overlapping, specific criteria to avoid “phantom requirements.”
- Generation from 1 alone, with scoring performed on 2.
- Handling of conditional requirements—criteria are excluded if the antecedent does not hold.
- Two-pass scoring (actions alone and full-screenshot context) to reveal agent hallucinations or omissions.
- Segregated “side-effect” evaluation for penalizing undesired actions such as unsolicited cart additions.
3. Evaluation Metrics and Agreement
CUAVerifierBench employs established classification metrics and inter-annotator agreement statistics:
- Standard metrics:
- Accuracy, Precision, Recall, F1 as in:
3 - False Positive Rate (FPR): 4 - False Negative Rate (FNR): 5 - Human–verifier agreement: 6
- Cohen’s 7 for inter-annotator reliability:
8
- Outcome (UV-blind): 9
- Outcome (UV-informed): 0
- Process (binary, UV-blind): 1
- Process (binarized at 2): 3
- Universal Verifier–human agreement: 4 (outcome), 5 (process), matching human–human levels (Rosset et al., 5 Apr 2026).
4. Benchmark Results and Comparative Analysis
CUAVerifierBench enables fine-grained comparison among verifiers including Universal Verifier (UV), WebVoyager, and WebJudge. Metrics are reported separately for outcome and process scores across both splits. The following table summarizes outcome and process agreement as Cohen’s 6 and FPR:
| Split | Verifier | 7 | FPR8 | 9 | FPR0 |
|---|---|---|---|---|---|
| Internal | WebVoyager@4o | 0.31 | 0.45 | 0.17 | 0.52 |
| WebJudge@o4-mini | 0.44 | 0.22 | 0.32 | 0.25 | |
| UV (GPT-5.2) | 0.64 | 0.01 | 0.59 | 0.04 | |
| Browserbase | WebVoyager@4o | 0.13 | 0.60 | 0.22 | 0.56 |
| WebJudge@o4-mini | 0.26 | 0.40 | 0.34 | 0.38 | |
| UV (GPT-5.2) | 0.58 | 0.08 | 0.43 | 0.20 |
Results indicate that the Universal Verifier, leveraging a dedicated rubric-driven multimodal pipeline and GPT-5.2, achieves human-level 1 and minimizes false positives compared to WebVoyager or WebJudge. Notably, backbone upgrades in comparator verifiers reduce FPR but increase false negatives, confirming that performance gains arise from architectural innovations rather than model scaling alone.
Ablation studies further demonstrate that rubric-scoring separation and systematic context management are critical for high-fidelity process and outcome verification, rather than backbone LLM choice in isolation (Rosset et al., 5 Apr 2026).
5. Rubric Generation, Multimodal Scoring, and Pipeline
The Universal Verifier’s pipeline—operating on CUAVerifierBench—proceeds as follows:
- Rubric Generation: Extract 2 disjoint criteria from the natural-language goal 3.
- Screenshot Relevance Matrix: Score screenshots 4 against each criterion, forming a relevance matrix 5.
- Top-6 Context Selection: Identify top-7 screenshots per criterion to balance thoroughness and efficiency.
- Two-Pass Scoring: First pass evaluates action-only context; the second incorporates full multimodal input.
- Side-Effect Detection: Explicit pass penalizes unrequested side effects.
- Outcome Verification: Outputs binary verdict of task completion.
- Failure Diagnosis: Assigns a failure code from a 24-category taxonomy for systematic error analysis.
This pipeline architecture is open-sourced in the UniversalVerifier class (Algorithm 1), and implemented to facilitate external benchmarking and research reproducibility. Data and code are available at https://github.com/microsoft/fara (Rosset et al., 5 Apr 2026).
6. Usage, Accessibility, and Benchmarking Practices
CUAVerifierBench is provided as a Python-accessible dataset and pipeline, with accompanying rubric templates, scoring instructions, and worked examples. The repository comprises both splits, full annotation metadata, prompt assets, and automated evaluation scripts. Researchers interact with the benchmark as follows:
8
This enables reproducible scoring, diagnostic reporting, and detailed quantitative analyses. A plausible implication is that this infrastructure streamlines both system development and comparative studies across agent alignment and multimodal verification research.
7. Impact, Limitations, and Future Directions
CUAVerifierBench establishes the first process-and-outcome human-labeled standard for CUA trajectory verification at relevant scale. Its design incentivizes verifiers to address both the correctness of final outcomes and the faithfulness of agent processes, filling gaps left by prior task success benchmarks that focused solely on outcomes or lacked formal human agreement studies.
Benchmark results show that architectural pipeline design—specifically rubric generation, context selection, and multimodal scoring—is the primary determinant of verifier success, rather than LLM backbone scaling alone. This suggests further improvements may derive from advances in rubric learning and multimodal interaction, rather than solely from larger LLMs.
As the benchmark is open to the research community, extensions are anticipated in trajectory complexity, domain coverage, and formalization of rubric construction. The split between process and outcome reward additionally points to new analyses in agent error modes and the challenge of aligning agents toward not just final but also intermediate behaviors (Rosset et al., 5 Apr 2026).