DEACTION: LLM Guardrail for CUAs

Updated 26 March 2026

DEACTION is a plug-and-play LLM-based guardrail that detects and corrects misaligned actions in computer-use agents using a structured, two-stage process.
It employs a fast screening phase for low-latency checks followed by systematic analysis to provide actionable feedback for realignment with user intent.
Empirical evaluations show an 82.8% F₁ score and over 90% reduction in adversarial attack success rates, demonstrating robust performance.

DEACTION refers to a plug-and-play LLM-based guardrail for real-time detection and correction of misaligned actions in Computer-Use Agents (CUAs). It operationalizes a structured, two-stage process that intercepts every candidate graphical user interface (GUI) action, applies both fast screening and deep, structured inspection, and can further steer the agent toward alignment with the user's authentic intent by delivering actionable feedback. DEACTION is evaluated on both offline benchmarks and online adversarial scenarios, where it consistently outperforms prior baselines in F₁, robustness, and user-utility retention (Ning et al., 9 Feb 2026).

1. Problem Formulation: Misaligned Action Detection

DEACTION addresses the binary classification problem of misaligned-action detection in CUAs. Input at each time $t$ comprises:

$I \in 𝒮$ , the user's original instruction;
$T_{<t} = (o_{1}, a_{1}, \ldots, o_{t-1}, a_{t-1})$ , the history of prior observations and actions;
$o_t$ , the current observation (e.g., screenshot or interface state);
$a_t$ , the CUA's next proposed low-level GUI action.

The classifier $D$ is thus defined as: $D: (I,\ T_{<t},\ o_t,\ a_t) \rightarrow \{\textrm{Aligned},\ \textrm{Misaligned}\}$

While a probabilistic alignment score $S(I, T_{<t}, o_t, a_t) \equiv P(\textrm{aligned} \mid I, T_{<t}, o_t, a_t)$ and thresholding could be considered, DEACTION employs prompt-based, LLM-driven rule application rather than a numerically trained classifier with loss optimization.

2. Architecture and Algorithmic Structure

DEACTION wraps arbitrary CUAs in a two-stage runtime guardrail and iterates per action with:

Fast Screening ("Stage 1"): A single LLM prompt outputs a rationale ("thought") and a binary align flag. If the action is clearly aligned, execution proceeds with minimal latency (∼4s per step, covering ≈45% of steps).
Systematic Analysis ("Stage 2"): For nonobvious cases, a multi-part LLM-guided analysis is invoked, comprising:
- Injection Analysis—detects on-screen attack or deceptive content.
- Action Understanding—semantically interprets $a_t$ 's GUI effect.
- Outcome Prediction—anticipates the (as-yet-unobserved) result of $a_t$ .
- Misalignment Analysis—integrates above signals for a final alignment verdict.

If systematic analysis returns "Misaligned," feedback is extracted in structured form, focusing on:

Identified prompt attacks,
Outcome divergence from user intent,
Contextual or logical misunderstandings.

Feedback is supplied to the base CUA, requesting a revised action. Iterative correction proceeds for up to $K=3$ cycles; 78% of misaligned actions are corrected in a single round.

Pseudocode (selected steps, verbatim):

for t = 1,2,… do
  receive proposed action a_t from CUA
  # Stage 0: summary compression
  use narrative_summarizer to compress (o_{t–1}, a_{t–1}, o_t) into summary h_t
  # Stage 1: Fast Check
  if fast_check(I, h_t, o_t, a_t) == “Aligned”:
      execute a_t
      continue
  # Stage 2: Systematic Analysis & Correction
  for iter = 1..K:
      R = systematic_analysis(I, h_t, o_t, a_t)
      if R.conclusion == “Aligned”:
          execute a_t
          break
      else:
          a_t = query_CUA_with_feedback(F)
  if iter > K:
      abort or fallback to human

3. Evaluation Protocols and Benchmarks

3.1 Offline Benchmark: MISACTBENCH

Data: 558 real-world CUA trajectories; 2,264 steps (1,000 misaligned, 1,264 aligned).
Misalignment Categories:
- Malicious Instruction Following: 562 steps (56.2%)
- Harmful Unintended Behavior: 210 steps (21.0%)
- Other Task-Irrelevant Behavior: 228 steps (22.8%)
Source Construction: Includes attacks from OS-Harm, DoomArena, RedTeamCUA, RiOSWorld; and benign tasks injected with internal subgoals (manually validated).
Annotation: Two-phase human labeling protocol, Fleiss’ κ = 0.84; action-level labels with category tags.

3.2 Quantitative Metrics

Let $\textrm{TP, FP, FN}$ refer to true/false positives/negatives on "misaligned" classification: $\textrm{Precision} = \frac{\textrm{TP}}{\textrm{TP}+\textrm{FP}}, \qquad \textrm{Recall} = \frac{\textrm{TP}}{\textrm{TP}+\textrm{FN}}, \qquad F_1 = \frac{2 \cdot \mathrm{Precision} \cdot \mathrm{Recall}}{\mathrm{Precision} + \mathrm{Recall}}$

Empirical results (Table 3, GPT-5.1 Thinking backbone):

Method	F₁ Score (%)
Task Shield	66.9
InferAct	62.3
DEACTION	82.8

Thus, DEACTION offers a >15% absolute F₁ improvement over previous methods.

3.3 Online End-to-End Evaluation

Scenarios: 50 adversarial (RedTeamCUA, indirect prompt injections) and 50 benign (OSWorld) tasks.
Metrics:
- Attack Success Rate (ASR): Frequency of successful adversarial payloads ( $\downarrow$ better).
- Utility under Attack (UA): User task completion rate with active attacks ( $\uparrow$ better).
- Success Rate (SR): Task completion rate in benign settings.
Results (three leading CUAs):
- Without defense ASR ≈ 32–42%; with DEACTION ASR = 2–6% (>90% reduction).
- UA improved from 48–82% to 60–84%.
- Benign SR preserved or improved (e.g., 26% → 30.7% on Claude 4.5 Computer Use).
No formal confidence intervals or statistical significance tests were reported.

4. Algorithmic Insights and Key Ablations

Intent-centric detection: DEACTION's alignment judgements focus on whether the action "can be justified as advancing the user's authentic intent," not merely on policy rule violations.
Two-stage pipeline: The separation of fast, low-latency screening from structured, high-fidelity analysis enables efficient filtering of obviously benign actions while minimizing overhead.
Narrative summarization: Compact summaries (≈41 tokens/step) of recent history replace raw screenshots (>1k tokens), lowering latency and improving F₁ by 5 points.
Structured feedback: Rather than block actions outright, DEACTION offers guidance for CUA self-correction (within up to three rounds).
LLM model swap ablation: The fast check and summarizer can be downgraded (e.g., GPT-5 mini) at ~65% reduced cost and without F₁ loss, but the systematic phase benefits from a stronger LLM backbone.

5. Limitations, Failure Modes, and Future Directions

Failure cases:
- Sophisticated prompt injections may bypass fast screening.
- GUI grounding (mapping coordinates to semantically meaningful elements) can be error-prone.
- Overzealous outcome prediction or strict relevance checks can cause false positives.
Implementation limits:
- No formally trained classifier parameters or gradient-based optimization; performance depends on the base LLM's reasoning capabilities.
- Absence of theoretical guarantees on convergence/latency/misclassification.
Future work:
- End-to-end supervised fine-tuning on action-alignment labels with explicit loss functions.
- Closer integration of vision and action grounding for improved GUI context understanding.
- More robust, possibly neural, prompt-injection detectors for adversarial obfuscation strategies.

6. Theoretical and Practical Complexity

DEACTION's average per-step latency is 11 seconds (4s for fast check, 7s if systematic analysis is triggered), comprising about 25% of end-to-end CUA runtime during interactive execution. While formal big-O or convergence proofs are not provided, the empirical runtime and error correction rates are reported as compatible with practical deployment in interactive settings.

7. Significance and Context in Aligned AI Agents

DEACTION is distinguished as the first comprehensive framework for detecting and correcting both externally-induced and internally-arising misaligned actions in real-world CUA deployments. Its operational paradigm—intercepting, analyzing, and iteratively correcting GUI actions—simultaneously addresses safety, robustness, and user-task retention without significant performance compromise. Its design demonstrates that near-term safety and usability gains can be achieved through structured LLM prompt engineering and actionable, iterative feedback, even in the absence of full end-to-end parametric optimization (Ning et al., 9 Feb 2026).

Markdown Report Issue Upgrade to Chat

References (1)

When Actions Go Off-Task: Detecting and Correcting Misaligned Actions in Computer-Use Agents (2026)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to DEACTION.