Reliability Analysis of CUAs

Updated 22 April 2026

Reliability analysis of CUAs is a multidimensional evaluation that measures agents' consistent performance and stability using metrics like success rate and behavioral variance.
The analysis employs formal metrics, including Marginal Task Success Rate, Unified Reliability Surface, and Error-Propagation Rate, to quantify performance under various conditions.
Empirical protocols use multi-run tests, simulated faults, and oversight mechanisms such as memory verification and action guardrails to improve agent safety and handle execution variability.

A computer-use agent (CUA) is an autonomous or semi-autonomous system that perceives, plans, and executes sequences of actions—such as GUI interactions or tool calls—on digital environments in response to high-level, typically natural-language instructions. Reliability analysis of CUAs concerns the quantification, interpretation, and improvement of the extent to which such agents consistently, safely, and efficiently achieve user goals under diverse conditions and repeated execution. Rigorous reliability assessment is foundational to the deployment of CUAs in productivity, enterprise, healthcare, and safety-critical domains, structuring both architectural evolution and oversight mechanisms.

1. Formalization of Reliability Metrics and Dimensions

Modern reliability analysis departs from traditional single-run accuracy and success metrics in favor of an expanded, multidimensional treatment. Canonical metrics as drawn from recent research include:

Marginal Task Success Rate (SR): The proportion of tasks completed successfully in a single execution. For $N$ tasks with $N_\mathrm{succ}$ successes,

$\mathrm{SR} = \frac{N_\mathrm{succ}}{N}$

(Yang et al., 20 Oct 2025, Gonzalez-Pumariega et al., 20 Apr 2026).

Repeated-run Success and Consistency: Given $k$ runs per task, $\text{Pass}\{k\}$ denotes the probability all $k$ runs succeed:

$\text{Pass}\{k\} = \mathbb{E}_{x\sim\mathcal X}\left[\frac{\binom{c_x}{k}}{\binom{n}{k}}\right]$

Where $c_x$ counts successful runs on task $x$ (Gonzalez-Pumariega et al., 20 Apr 2026, Gupta, 3 Jan 2026).

Unified Reliability Surface: ReliabilityBench defines $R(k, \epsilon, \lambda)$ , where $N_\mathrm{succ}$ 0 is repeated executions, $N_\mathrm{succ}$ 1 the perturbation intensity (e.g., paraphrasing, action metamorphic relations), and $N_\mathrm{succ}$ 2 the fault intensity (tool/API failures), capturing multidimensional reliability (Gupta, 3 Jan 2026).
Behavioral Variance ( $N_\mathrm{succ}$ 3): Run-to-run standard deviation of success, critical for assessing execution stability (Zhang et al., 28 May 2025).
Subtask Success Rate (SSR) and End-to-End Task Success Rate (TSR): SSR quantifies correct execution of fine-grained steps, TSR is strict and counts success only when all subtasks succeed (Bedi et al., 10 Apr 2026).
Error-Propagation Rate (EPR); Efficiency: Models the effect of action chain length on overall success via cascading failures, and efficiency via step counts (Yang et al., 20 Oct 2025).
Alignment Metrics: Detect and correct misaligned actions before execution, reporting F1, precision, recall, and attack success rate under adversarial stress (Ning et al., 9 Feb 2026).

Beyond these, there are holistic twelve-metric decompositions along the axes of consistency (e.g., outcome and trajectory consistency), robustness (to faults, prompts, environment), predictability (confidence calibration, discrimination), and safety (constraint compliance, harm severity) (Rabanser et al., 18 Feb 2026).

2. Sources and Taxonomy of Unreliability

Reliability failures in CUAs do not stem solely from lack of knowledge or algorithmic capability; rather, they are decomposed as follows:

Stochasticity in Execution: Agents exhibit variability due to random sampling in LLM decoding and sensitivity to UI/environmental noise. Deterministic policies do not always yield superior repeatability and may degrade adaptability (Gonzalez-Pumariega et al., 20 Apr 2026).
Ambiguity in Task Specification: Natural language instructions are underspecified and may admit multiple interpretations. Ambiguity propagates to inconsistent execution unless disambiguated via clarification or user feedback (Gonzalez-Pumariega et al., 20 Apr 2026).
Planning Variability: Non-deterministic or incommensurate high-level plans across runs—stemming from LLM sampling—lead to divergent (and sometimes trapped) strategies even with clarified inputs (Gonzalez-Pumariega et al., 20 Apr 2026).
Memory Drift and Policy Drift: Continual memory augmentation without oversight can encode spurious rules, causing silent drift from user intent and safety constraints (Nguyen et al., 3 Jun 2025).
Action Misalignment: Includes both externally induced (e.g., prompt injection) and internally arising (e.g., overgeneralization, unintended side effects) misalignment (Ning et al., 9 Feb 2026).
Cascading Error Propagation: In pure-primitive agents, the multiplicative effect of stepwise execution errors reduces end-to-end success exponentially, especially for long-horizon tasks (Yang et al., 20 Oct 2025).
Architectural Gaps: Empirically diagnosed limitations in memory management, lack of hierarchical planning, and state coordination lead to failures on complex workflows (Cristescu et al., 21 Nov 2025).
Safety and Adversarial Robustness: High Unsafe Intention and Completion Rates (USR_intention/USR_completion), notably on domains like phishing, OS commands, and induced text, indicate a persistent inability to reliably refuse or detect risky actions (Yang et al., 31 May 2025).

3. Benchmarks, Methodologies, and Empirical Protocols

Reliability evaluations in recent literature systematically employ repeatable, multi-run protocols over diverse and high-coverage benchmarks:

Multi-run Paired Statistical Tests: For each task, repeated executions under baseline and perturbation/intervention, with paired McNemar and Wilcoxon signed-rank tests to establish significance of reliability shifts (Gonzalez-Pumariega et al., 20 Apr 2026).
Task and Subtask Decomposition: Large-scale benchmarks (e.g., OSWorld, UI-CUBE, HealthAdminBench) decompose tasks into atomic or hierarchical subtasks to measure both fine-grained and end-to-end reliability (Cristescu et al., 21 Nov 2025, Bedi et al., 10 Apr 2026).
Simulated Faults and Perturbations: Fault injection (timeouts, rate limits, schema drift), environmental or prompt perturbation, and action metamorphic relations are systematically incorporated (Gupta, 3 Jan 2026, Rabanser et al., 18 Feb 2026).
Adversarial and Realistic Risk Stress Testing: Red-team injections, risk goal labeling (intention/completion rates), and direct human-validation of adversarially synthesized misalignment trajectories (Nguyen et al., 3 Jun 2025, Yang et al., 31 May 2025, Ning et al., 9 Feb 2026).

A representative experimental algorithm involves running each policy on a suite of tasks for $N_\mathrm{succ}$ 4 repetitions under different conditions, logging run-level outcomes, and then post-processing results using statistical analyses to obtain both marginal and per-task reliability scores (Gonzalez-Pumariega et al., 20 Apr 2026).

4. Oversight, Correction, and Scalability Mechanisms

Achieving reliable CUA operation in practical settings necessitates explicit oversight mechanisms that prevent, detect, and correct misalignment and execution drift:

Memory Oversight and Verification: VerificAgent establishes domain-specific frozen memories via post-hoc human and automated review, curtailing silent accrual of spurious or unsafe heuristic growth (Nguyen et al., 3 Jun 2025). Human-verified (frozen) memory (M*) formally bounds policy drift and reduces hallucination-induced failures to zero in deployment on OSWorld and RTMI adversarial evaluation.
Action Guardrails and Iterative Correction: DeAction provides online, two-stage detection and repair of misaligned actions, combining a fast LLM screening with deep, structured analysis. Iterative correction loops are shown to fix 78% of misalignments, with moderate (≈25%) latency overhead (Ning et al., 9 Feb 2026).
Vision-Language Judgment for Task Completion: Integration of VLM-based end-state judges can autonomously and reliably signal task success or failure, driving self-correction and yielding an average 27% relative improvement in task success rates over baselines (Sumyk et al., 25 Nov 2025).
Memory and Knowledge Evolution: UI-Evol automates knowledge correction via retrospection and critique of actual execution traces, reducing performance variance by 30–75% and bridging the gap between theoretically correct priors and practically effective plans (Zhang et al., 28 May 2025).

Scalability concerns—driven by the scale of unvetted memory entries, or the cost of continual online correction—are addressed via tiered filtering, sampling, or active learning prioritization in oversight (Nguyen et al., 3 Jun 2025, Ning et al., 9 Feb 2026).

5. Reliability-Bolstering Architectures and Trade-offs

Architectural design is central to operational reliability, with clear trade-offs and empirically validated gains:

Hybrid Action Architectures: UltraCUA demonstrates that interleaving high-confidence programmatic tool calls with primitive GUI actions sharply mitigates cascading failures, yielding up to 23% relative improvement in task success rate and 10% step reduction compared to GUI-only baselines (Yang et al., 20 Oct 2025).
Adaptive Model Routing and Guardrails: Adaptive VLM Routing dynamically selects the cheapest vision-LLM that meets per-action reliability thresholds, preserving reliability within 2 percentage points of all-large-model baselines while reducing cost by up to 78%—with safety-critical tool calls always escalated to the strongest available model (Liu et al., 13 Mar 2026).
Hierarchical Planning, Memory Buffers, and State Coordination: UI-CUBE and HealthAdminBench reveal capability cliffs linked to a lack of persistent working memory, hierarchical/discrete planning, and robust state modelling, especially in complex enterprise tasks with long-horizon dependencies (Cristescu et al., 21 Nov 2025, Bedi et al., 10 Apr 2026).
Procedural Prompting and External Guidance: Workflow-centric “Portal Guidance” or external skeletons raise success rates by 5–10 pp and abate navigation errors (Bedi et al., 10 Apr 2026).
Limitations and Open Challenges: Persistent failure modes include residual policy-translation errors (~40%), low-fidelity action synthesis, and incomplete handling of cross-domain state or ambiguous, drifting instructions. Cross-platform generalization, scalable oversight, and formal runtime wrappers for critical environments remain active research areas (Nguyen et al., 3 Jun 2025, Yang et al., 20 Oct 2025, Rabanser et al., 18 Feb 2026).

6. Safety, Alignment, and Adversarial Robustness

Safety is recognized as an inseparable pillar of reliability in CUAs:

Risk Intention and Completion Rates: On RiOSWorld, frontier multimodal agents register USR_intention ≈ 85% and USR_completion ≈ 60% across both user-originated and environmental risk tasks, with especially high rates on phishing, code execution, and OS operations (Yang et al., 31 May 2025).
Safety Alignment Strategies: Effective oversight mandates integrated risk detectors, explicit refusal policies for high-risk POMDP states, and hierarchical policies guided by “safety-first” reward signals, as well as continual adversarial training and human-in-the-loop confirmation for destructive actions (Yang et al., 31 May 2025).
Constraint Compliance and Harm Scoring: Reliability evaluations increasingly include LLM-judged constraint compliance and fine-grained harm severity assessment, with even top-tier agents occasionally violating critical constraints or exhibiting non-zero high-severity error tails (Rabanser et al., 18 Feb 2026).

Experimental findings indicate that, while accuracy is improving, alignment failures and risk noncompliance remain stubborn, requiring co-development of robust detection, human review, and architectural integration.

7. Recommendations and Future Research Directions

Comprehensive reliability analysis compels several converging recommendations:

Evaluate Reliability, Not Just Capability: Marginal accuracy alone is insufficient. Practitioners must report stability across repeated runs, under perturbations, and following adversarial threats (Rabanser et al., 18 Feb 2026, Gonzalez-Pumariega et al., 20 Apr 2026, Gupta, 3 Jan 2026).
Benchmark Diversity and Perturbability: Deployment-quality agents should be evaluated on synthetic, adversarial, and real-world benchmarks supporting repeated, interactive, and fault-injected protocols (Cristescu et al., 21 Nov 2025, Bedi et al., 10 Apr 2026, Gupta, 3 Jan 2026).
Scalable Oversight Pipelines: Frozen, human-verified memories, iterative action correction, and dynamic guardrails are essential for bounding policy drift and anchoring agent behavior (Nguyen et al., 3 Jun 2025, Ning et al., 9 Feb 2026).
Integrated Architectures with Redundancy: Combine high-confidence tool APIs with GUI primitives, predictive model selection, and procedural plans to break cascading failures and handle scale and variability (Yang et al., 20 Oct 2025, Liu et al., 13 Mar 2026).
Active Handling of Instruction Ambiguity and Planning Variability: Interactive clarification, plan extraction, and refinement via self-improving protocols (e.g., Reflexion, Self-Refine) enhance reliability in realistic user scenarios (Gonzalez-Pumariega et al., 20 Apr 2026).
Continual Evaluation and CI/CD Integration: Reliability metrics should be continuously monitored, embedded into development pipelines, and re-benchmarked as environments and user patterns evolve (Gupta, 3 Jan 2026).
Safety-Aware Alignment and Auditing: Risk signals, LLM-based auditors, and severity scoring should be incorporated into both runtime and post-hoc examination (Yang et al., 31 May 2025, Rabanser et al., 18 Feb 2026).

Reliable computer-use agents will require ongoing synthesis of behavioral evaluation protocols, architectural innovation, and safety-aligned oversight frameworks, backed by empirically rigorous, multi-dimensional reliability analysis.