DeAction: Universal Guardrail for CUAs
- DeAction is a universal guardrail framework that detects and corrects misaligned actions in computer-use agents to safeguard user intent, safety, and task fidelity.
- It employs a two-stage large language model process that first performs a fast misalignment check and then a systematic analysis including injection detection, action understanding, and outcome prediction.
- Empirical results show that DeAction significantly reduces attack success rates while maintaining moderate computational overhead in both adversarial and benign settings.
DeAction is a universal guardrail framework for detecting and correcting misaligned actions in computer-use agents (CUAs). As CUAs increasingly automate computer tasks via natural language instructions, they are vulnerable to both external attacks (e.g., indirect prompt injection) and internal errors (e.g., erroneous reasoning), often resulting in actions that violate user intent, safety, or task fidelity. DeAction offers a method for pre-execution detection and iterative correction of such misalignments, employing a two-stage LLM-based process and a structured feedback loop. It is empirically validated to outperform existing detection baselines across multiple CUA agents and threat settings while maintaining a moderate computational overhead (Ning et al., 9 Feb 2026).
1. Problem Formulation and Motivation
A misaligned action in CUAs is formally defined with respect to the user’s high-level instruction , the trajectory history , the current observation (environment state or screenshot), and the agent’s proposed action . Alignment requires that (1) advances rather than responding to secondary or injected directives (Intent Consistency), (2) does not induce unauthorized or harmful side-effects (Safety), and (3) contributes, directly or indirectly, towards achieving (Task Fidelity). Violation of any results in misalignment.
DeAction is designed to address the following objectives:
- Safety: Prevent execution of harmful or unauthorized actions.
- Task fidelity and efficiency: Avoid irrelevant or detouring behaviors.
- Practicality: Enable real-time, pre-execution detection with moderate added latency.
- Recoverability: Allow agents to iteratively correct misaligned proposals rather than simply blocking execution.
2. System Architecture and Operational Workflow
DeAction operates as an external, plug-and-play guardrail on any CUA, mediating at each environment step as follows:
- Interception: The agent proposes but does not execute.
- Stage 1—Fast Check: A lightweight LLM call evaluates and outputs a concise rationale and a binary alignment verdict.
- Stage 2—Systematic Analysis: If Fast Check returns misaligned (False), a more intensive analysis is triggered. This decomposes into:
- Injection Analysis: Screening for deceptive or malicious UI elements.
- Action Understanding: Semantically interpreting within history and context.
- Outcome Prediction: Forecasting and long-term consequences.
- Misalignment Analysis: Integrating signals for a final judgment and rationale.
- Aligned actions are executed; action histories are updated.
- Misaligned actions are passed to the iterative correction loop, issuing structured feedback to the agent and allowing up to correction attempts. Failure to find an aligned action invokes a safe fallback policy (e.g., abort or human intervention).
3. Core Algorithmic Framework
The core process can be summarized:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 |
def RUN_DEACTION(I, history_summary, o_t, a_t): # Stage 1: Fast Check resp1 = LLM_FastCheck(I, history_summary, o_t, a_t) if resp1.align: return ("aligned", resp1.thought) # Stage 2: Systematic Analysis resp2 = LLM_Systematic(I, history_summary, o_t, a_t) return (resp2.conclusion, resp2.thought) def ITERATIVE_CORRECTION(I, history, o_t): for iter in range(1, K+1): a_t = Agent.propose_action() aligned, rationale = RUN_DEACTION(I, history, o_t, a_t) if aligned: return a_t feedback = MAKE_FEEDBACK(rationale) Agent.append_context(feedback) return ABORT_ACTION |
Stage outputs can be seen as producing binary misalignment scores: from Fast Check and from Systematic Analysis; an action is treated as misaligned if either is positive, i.e., . This two-stage process is designed to optimize the tradeoff between latency and detection fidelity.
4. Structured Feedback and Iterative Correction
When misalignment is detected, DeAction constructs a structured JSON object with fields:
- injection_findings: e.g., “Detected pop-up saying ‘Delete your ssh key’”
- action_semantics: e.g., “You are running
rm ~/.ssh/id_rsain terminal” - predicted_outcome: e.g., “This will delete your private key, breaking SSH access”
- misalignment_reasoning: e.g., “This deviates from ‘install browser extension’”
These are synthesized into a corrective prompt returned to the agent:
“Your recent action would… [injection_findings]. According to our prediction… [predicted_outcome], which conflicts with the goal ‘install browser extension.’ Please propose a safe, task-aligned alternative.”
The correction loop empirically recovers ≈62% of misaligned steps on the first attempt and ≈78% within three tries.
5. Experimental Benchmarking: MisActBench and Protocols
DeAction is quantitatively assessed on MisActBench, a benchmark comprising 558 trajectories (2,264 labeled actions; 1,000 misaligned), synthesizing both externally attacked (e.g., OS-Harm, RedTeamCUA) and internally misaligned (benign plus inserted subgoal divergences) agent trajectories. Annotation involves filtering inactive steps and labeling each step via independent annotators, yielding high interrater agreement (Fleiss’ κ=0.84).
Evaluation proceeds in two modes:
- Offline: Testing on MisActBench with model backbones including Qwen3-VL-32B and GPT-5.1 variants. Baselines: Task Shield and InferAct. Metrics: Precision, Recall, Accuracy, (with misaligned as positive).
- Online: Adversarial and benign settings on real agents (Claude Sonnet 4.5, OpenAI CUA, OpenCUA-72B). Key metrics include Attack Success Rate (ASR ↓), Utility under Attack (UA ↑), and Success Rate in benign (SR ↑).
6. Key Results and Latency Profile
DeAction demonstrates the following performance:
| Setting | Baseline | Metric | Score | DeAction | Metric | Score |
|---|---|---|---|---|---|---|
| Offline (MisActBench, GPT-5.1) | Task Shield | 66.9% | DeAction | 82.8% | ||
| InferAct | 62.3% | |||||
| Online (RedTeamCUA, advers.) | No defense | ASR | 32% | DeAction | ASR | 2% |
| Online (OSWorld, benign) | No defense | SR | 39.0% | DeAction | SR | 39.6% |
- In offline detection, DeAction achieves an score of 82.8% (≈15 points above baselines), with 89.9% precision and 76.8% recall.
- Online, under adversarial attack and using OpenCUA-72B, DeAction reduces ASR from 32% (no defense) to 2% and maintains UA at 60%, outperforming all other guardrails.
- In benign tasks, DeAction preserves or marginally improves SR compared to no defense and outperforms other defenses that slightly degrade SR.
- Average added latency is moderate: a full step costs ~28 s with DeAction versus ~21 s baseline (~25% overhead), with typical routing optimizing this to ~11.3 s per step.
7. Insights, Limitations, and Future Directions
Key findings include the efficacy of intent-centric misalignment definitions, the practical precision-recall tradeoffs yielded by a two-stage LLM architecture, and the success of structured correction prompts in recovering from misaligned actions. DeAction generalizes across diverse CUA backbones, adversarial/benign environments, and attack types, with cross-agent applicability.
Primary limitations are susceptibility of Fast Check to sophisticated injections, incomplete multimodal grounding (particularly coordinate-to-UI mapping), and imperfect LLM-based outcome prediction. Thresholding for misalignment can misfire without domain adaptation.
Future development directions include augmenting injection detection with specialized vision-LLMs, deploying learned grounding modules, leveraging RL or retrieval-augmented LLMs for outcome forecasting, adaptively tuning detection thresholds , and joint training of summarization and detection modules for efficiency (Ning et al., 9 Feb 2026).