DEACTION: LLM Guardrail for CUAs
- DEACTION is a plug-and-play LLM-based guardrail that detects and corrects misaligned actions in computer-use agents using a structured, two-stage process.
- It employs a fast screening phase for low-latency checks followed by systematic analysis to provide actionable feedback for realignment with user intent.
- Empirical evaluations show an 82.8% F₁ score and over 90% reduction in adversarial attack success rates, demonstrating robust performance.
DEACTION refers to a plug-and-play LLM-based guardrail for real-time detection and correction of misaligned actions in Computer-Use Agents (CUAs). It operationalizes a structured, two-stage process that intercepts every candidate graphical user interface (GUI) action, applies both fast screening and deep, structured inspection, and can further steer the agent toward alignment with the user's authentic intent by delivering actionable feedback. DEACTION is evaluated on both offline benchmarks and online adversarial scenarios, where it consistently outperforms prior baselines in F₁, robustness, and user-utility retention (Ning et al., 9 Feb 2026).
1. Problem Formulation: Misaligned Action Detection
DEACTION addresses the binary classification problem of misaligned-action detection in CUAs. Input at each time comprises:
- , the user's original instruction;
- , the history of prior observations and actions;
- , the current observation (e.g., screenshot or interface state);
- , the CUA's next proposed low-level GUI action.
The classifier is thus defined as:
While a probabilistic alignment score and thresholding could be considered, DEACTION employs prompt-based, LLM-driven rule application rather than a numerically trained classifier with loss optimization.
2. Architecture and Algorithmic Structure
DEACTION wraps arbitrary CUAs in a two-stage runtime guardrail and iterates per action with:
- Fast Screening ("Stage 1"): A single LLM prompt outputs a rationale ("thought") and a binary align flag. If the action is clearly aligned, execution proceeds with minimal latency (∼4s per step, covering ≈45% of steps).
- Systematic Analysis ("Stage 2"): For nonobvious cases, a multi-part LLM-guided analysis is invoked, comprising:
- Injection Analysis—detects on-screen attack or deceptive content.
- Action Understanding—semantically interprets 's GUI effect.
- Outcome Prediction—anticipates the (as-yet-unobserved) result of .
- Misalignment Analysis—integrates above signals for a final alignment verdict.
If systematic analysis returns "Misaligned," feedback is extracted in structured form, focusing on:
- Identified prompt attacks,
- Outcome divergence from user intent,
- Contextual or logical misunderstandings.
Feedback is supplied to the base CUA, requesting a revised action. Iterative correction proceeds for up to cycles; 78% of misaligned actions are corrected in a single round.
Pseudocode (selected steps, verbatim):
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 |
for t = 1,2,… do receive proposed action a_t from CUA # Stage 0: summary compression use narrative_summarizer to compress (o_{t–1}, a_{t–1}, o_t) into summary h_t # Stage 1: Fast Check if fast_check(I, h_t, o_t, a_t) == “Aligned”: execute a_t continue # Stage 2: Systematic Analysis & Correction for iter = 1..K: R = systematic_analysis(I, h_t, o_t, a_t) if R.conclusion == “Aligned”: execute a_t break else: a_t = query_CUA_with_feedback(F) if iter > K: abort or fallback to human |
3. Evaluation Protocols and Benchmarks
3.1 Offline Benchmark: MISACTBENCH
- Data: 558 real-world CUA trajectories; 2,264 steps (1,000 misaligned, 1,264 aligned).
- Misalignment Categories:
- Malicious Instruction Following: 562 steps (56.2%)
- Harmful Unintended Behavior: 210 steps (21.0%)
- Other Task-Irrelevant Behavior: 228 steps (22.8%)
- Source Construction: Includes attacks from OS-Harm, DoomArena, RedTeamCUA, RiOSWorld; and benign tasks injected with internal subgoals (manually validated).
- Annotation: Two-phase human labeling protocol, Fleiss’ κ = 0.84; action-level labels with category tags.
3.2 Quantitative Metrics
Let refer to true/false positives/negatives on "misaligned" classification:
Empirical results (Table 3, GPT-5.1 Thinking backbone):
| Method | F₁ Score (%) |
|---|---|
| Task Shield | 66.9 |
| InferAct | 62.3 |
| DEACTION | 82.8 |
Thus, DEACTION offers a >15% absolute F₁ improvement over previous methods.
3.3 Online End-to-End Evaluation
- Scenarios: 50 adversarial (RedTeamCUA, indirect prompt injections) and 50 benign (OSWorld) tasks.
- Metrics:
- Attack Success Rate (ASR): Frequency of successful adversarial payloads ( better).
- Utility under Attack (UA): User task completion rate with active attacks ( better).
- Success Rate (SR): Task completion rate in benign settings.
- Results (three leading CUAs):
- Without defense ASR ≈ 32–42%; with DEACTION ASR = 2–6% (>90% reduction).
- UA improved from 48–82% to 60–84%.
- Benign SR preserved or improved (e.g., 26% → 30.7% on Claude 4.5 Computer Use).
- No formal confidence intervals or statistical significance tests were reported.
4. Algorithmic Insights and Key Ablations
- Intent-centric detection: DEACTION's alignment judgements focus on whether the action "can be justified as advancing the user's authentic intent," not merely on policy rule violations.
- Two-stage pipeline: The separation of fast, low-latency screening from structured, high-fidelity analysis enables efficient filtering of obviously benign actions while minimizing overhead.
- Narrative summarization: Compact summaries (≈41 tokens/step) of recent history replace raw screenshots (>1k tokens), lowering latency and improving F₁ by 5 points.
- Structured feedback: Rather than block actions outright, DEACTION offers guidance for CUA self-correction (within up to three rounds).
- LLM model swap ablation: The fast check and summarizer can be downgraded (e.g., GPT-5 mini) at ~65% reduced cost and without F₁ loss, but the systematic phase benefits from a stronger LLM backbone.
5. Limitations, Failure Modes, and Future Directions
- Failure cases:
- Sophisticated prompt injections may bypass fast screening.
- GUI grounding (mapping coordinates to semantically meaningful elements) can be error-prone.
- Overzealous outcome prediction or strict relevance checks can cause false positives.
- Implementation limits:
- No formally trained classifier parameters or gradient-based optimization; performance depends on the base LLM's reasoning capabilities.
- Absence of theoretical guarantees on convergence/latency/misclassification.
- Future work:
- End-to-end supervised fine-tuning on action-alignment labels with explicit loss functions.
- Closer integration of vision and action grounding for improved GUI context understanding.
- More robust, possibly neural, prompt-injection detectors for adversarial obfuscation strategies.
6. Theoretical and Practical Complexity
DEACTION's average per-step latency is 11 seconds (4s for fast check, 7s if systematic analysis is triggered), comprising about 25% of end-to-end CUA runtime during interactive execution. While formal big-O or convergence proofs are not provided, the empirical runtime and error correction rates are reported as compatible with practical deployment in interactive settings.
7. Significance and Context in Aligned AI Agents
DEACTION is distinguished as the first comprehensive framework for detecting and correcting both externally-induced and internally-arising misaligned actions in real-world CUA deployments. Its operational paradigm—intercepting, analyzing, and iteratively correcting GUI actions—simultaneously addresses safety, robustness, and user-task retention without significant performance compromise. Its design demonstrates that near-term safety and usability gains can be achieved through structured LLM prompt engineering and actionable, iterative feedback, even in the absence of full end-to-end parametric optimization (Ning et al., 9 Feb 2026).