TraceGuard: Structured Multi-Dimensional Monitoring as a Collusion-Resistant Control Protocol

Published 5 Apr 2026 in cs.CR and cs.AI | (2604.03968v1)

Abstract: AI control protocols use monitors to detect attacks by untrusted AI agents, but standard single-score monitors face two limitations: they miss subtle attacks where outputs look clean but reasoning is off, and they collapse to near-zero safety when the monitor is the same model as the agent (collusion). We present TraceGuard, a structured multi-dimensional monitoring protocol that evaluates agent actions across five dimensions -- goal alignment, constraint adherence, reasoning coherence, safety awareness, and action-trace consistency -- scored in parallel by independent LLM calls, augmented by seven heuristic detectors and an LLM-based intent analyzer. We evaluate on BashArena (637 bash tasks, 4 attack categories) within the ControlArena framework. Our results on 519 samples (279 honest, 240 attack) show that: (1) the hybrid approach achieves clear attack-honest separation (attack mean 0.616 vs. honest mean 0.206, Delta=0.410); (2) structured scoring constrains collusion -- the untrusted structured monitor achieves 95% safety vs. 0% for single-score untrusted monitoring; (3) goal alignment and constraint adherence are the most discriminative dimensions; and (4) a separation-of-duties variant splitting dimensions across trusted and untrusted models achieves 100% safety while preventing any single model from seeing the full evaluation. TraceGuard is implemented as a new monitor type for the open-source ControlArena framework.

Abstract PDF Upgrade to Chat

Authors (3)

Summary

The paper presents a structured protocol that decomposes AI evaluation into five orthogonal dimensions, significantly enhancing collusion resistance.
It integrates an LLM-based scoring system with a heuristic detector pipeline, increasing the attack-honest suspicion gap from ~0.11 to 0.41.
The modular design, including a separation-of-duties variant, effectively prevents collusion and maintains high safety even with untrusted models.

TraceGuard: Multi-Dimensional Monitoring for Collusion-Resistant AI Control

Introduction

TraceGuard operationalizes a structured, multi-dimensional protocol for AI agent monitoring, targeting the inherent limitations of existing single-score schemes in trusted and untrusted contexts. Standard single-score monitors are susceptible to subtle strategic misalignments by agents, especially under collusion when the monitor and agent use the same model. TraceGuard introduces an explicit decomposition of agent evaluation into five semantically orthogonal dimensions: goal alignment, constraint adherence, reasoning coherence, safety awareness, and action-trace consistency. This design, integrated in the ControlArena environment, is further augmented by a heuristic detector pipeline and an LLM-based intent analyzer, addressing both attack detection and collusion resistance.

Methodology and System Architecture

TraceGuard formalizes the evaluation of agent actions and chain-of-thought (CoT) reasoning across five dimensions. Each dimension is scored independently via dedicated LLM prompts, with scores aggregated into a composite suspicion value using a calibrated weighting scheme. To further enhance detection of policy violations and strategic misalignment, the bash_heuristic monitor variant implements seven domain-specific heuristic detectors, followed by an intent analyzer that contextualizes the detected patterns before LLM-based dimension scoring.

The protocol includes four monitor variants: (1) trusted 5-dim (traceguard), (2) untrusted 5-dim (collusion_resistant), (3) separation-of-duties (splitting dimensions by trust level), and (4) bash_heuristic (detectors + intent + LLM). This modular structure, with strict trajectory isolation for detectors, allows precise attribution and manipulation of detection capabilities.

(Figure 1)

Figure 1: Safety-usefulness Pareto frontier across 21 suspicion thresholds (0.0–1.0, step 0.05). At threshold 0.5, the monitor achieves 70.8% safety with 93.9% usefulness.

The architecture achieves high concurrency: all LLM dimension scores are computed in parallel, with detector and intent analysis running asynchronously to avoid added serial latency.

Empirical Evaluation

TraceGuard was evaluated on 637 BashArena tasks and 176 AgentHarm scenarios using ControlArena. The main empirical findings are as follows:

Detector pipeline increases detection accuracy: Integration of heuristics lifts the attack-honest mean suspicion gap from ~0.11 (dimension-only) to 0.41 (attack mean 0.616, honest mean 0.206). The ROC curve and sample-level analysis demonstrate robust separation.

(Figure 2)

Figure 2: ROC curve for per-sample classification (attack vs. honest) based on maximum suspicion score per trajectory. Detector pipeline integration yields increased separation.

Collusion-resistance: Untrusted multi-dimensional monitors (collusion_resistant) achieve 95% safety at moderate thresholds, whereas single-score untrusted monitors collapse to 0% safety—paradigmatically failing to flag collusive attacks.
Figure 3: Safety (attack detection rate) by monitor and threshold. Single-score baselines (trusted/untrusted) achieve 0%; multi-dimensional protocols maintain high safety even when untrusted.
Dimension discriminability: Goal alignment (Δ=0.231) and constraint adherence (Δ=0.117) are the most discriminative axes separating attacks from legitimate traces. Safety awareness provides minimal incremental signal, likely due to underdeveloped prompting or undemanding attacker strategies in BashArena.
Figure 4: Distributions of per-dimension scores, highlighting stronger attack/honest separation in goal alignment than safety awareness.
Separation of duties: Assigning security-critical dimensions to a trusted monitor and the remaining reasoning-centric dimensions to an untrusted monitor (“separation-of-duties” variant) achieves 100% safety at lower thresholds. This validates that no single model (trusted or untrusted) gains a complete adversarial perspective, constraining collusion even under model diversity.

Theoretical and Practical Implications

The TraceGuard protocol provides sharp evidence that evaluation structure, rather than pure model trust level, fundamentally constrains collusion opportunities in agent oversight. By forcing models to respond to narrowly-scoped, orthogonally-defined prompts, horizontal collusion across dimensions becomes exponentially harder than minimizing a single-score criterion. This enables more reliable deployment of highly capable, potentially untrusted LLMs in control-critical environments without assuming their universal objectivity or honesty.

The practical significance is considerable: multi-dimensional and separation-of-duties protocols are directly compatible with existing agent sandboxes and orchestration frameworks (e.g., CrewAI, OpenClaw). Moreover, the structured composite scoring and detailed per-dimension rationales enhance regulatory traceability and facilitate compliance with emergent AI governance standards. The pipeline architecture also admits efficient parallelization, making the per-evaluation cost manageable under real-time or near-real-time monitoring constraints, especially if adaptive or learned dimension selection is employed.

Limitations and Future Work

Manual weighting of dimensions, fixed attack policy sophistication, and current domain specificity (BashArena, AgentHarm) are acknowledged limitations. Notably, the architecture presumes that attack strategies do not attempt complex multi-dimensional evasion coordinated across all axes; adversarial robustness to such adaptive tactics remains an open challenge.

Several future research avenues are specified:

Learning dimension weights via Bayesian or RL methods for task- and scenario-specific Pareto optimization,
Automated dimension subset selection for reduced API cost without sacrificing detection,
Expansion to more diverse agentic environments and workflows,
Integration as a live, streaming monitor in production control stacks with granular alerting,
Rigorous adversarial training and adaptive policy evaluation.

Conclusion

TraceGuard establishes structured, multi-dimensional monitoring as a practical and theoretically motivated protocol for robust AI agent oversight. The empirical results demonstrate that such structure—rather than mere trust in model separation—effectively constrains collusion and upgrades attack detectability, even when using untrusted LLMs as monitors. The separation-of-duties protocol further shows that full trust in any single model is unnecessary if the evaluation structure denies comprehensive adversarial visibility. TraceGuard thereby sets a new baseline for monitoring primitives in AI control, with clear avenues for future scaling, robustification, and deployment across general agentic systems.

Markdown Report Issue