TrajGuard: Streaming Hidden-state Trajectory Detection for Decoding-time Jailbreak Defense

Published 9 Apr 2026 in cs.CR and cs.AI | (2604.07727v1)

Abstract: Existing jailbreak defense paradigms primarily rely on static detection of prompts, outputs, or internal states, often neglecting the dynamic evolution of risk during decoding. This oversight leaves risk signals embedded in decoding trajectories underutilized, constituting a critical blind spot in current defense systems. In this work, we empirically demonstrate that hidden states in critical layers during the decoding phase carry stronger and more stable risk signals than input jailbreak prompts. Specifically, the hidden representations of tokens generated during jailbreak attempts progressively approach high-risk regions in the latent space. Based on this observation, we propose TrajGuard, a training-free, decoding-time defense framework. TrajGuard aggregates hidden-state trajectories via a sliding window to quantify risk in real time, triggering a lightweight semantic adjudication only when risk within a local window persistently exceeds a threshold. This mechanism enables the immediate interruption or constraint of subsequent decoding. Extensive experiments across 12 jailbreak attacks and various open-source LLMs show that TrajGuard achieves an average defense rate of 95%. Furthermore, it reduces detection latency to 5.2 ms/token while maintaining a false positive rate below 1.5%. These results confirm that hidden-state trajectories during decoding can effectively support real-time jailbreak detection, highlighting a promising direction for defenses without model modification.

Abstract PDF Upgrade to Chat

Authors (5)

Summary

The paper demonstrates that dynamic hidden-state trajectories reveal early risk signals, achieving a 95% detection rate against jailbreak attacks.
It employs Mahalanobis distance and hierarchical spatiotemporal aggregation to distinguish benign and harmful outputs efficiently.
Experimental results across various LLMs show low latency (5.2 ms/token) and high efficacy in real-time jailbreak defense.

TrajGuard: Streaming Hidden-state Trajectory Detection for Decoding-time Jailbreak Defense

Motivation and Theoretical Foundations

TrajGuard addresses the persistent shortcomings of static defense paradigms against LLM jailbreak attacks, particularly their inability to model the dynamic disclosure of malicious intent during output generation. The central empirical finding is that internal hidden-state trajectories during the decoding phase encode more reliable and temporally stable risk signals than static prompt or output representations. These signals exhibit a progressive, directional drift: although jailbreak prompts are generally camouflaged and reside close to benign prompts in the latent space, as token generation proceeds, the hidden states of malicious generations diverge toward harmful regions, unmasking adversarial intent over time.

This "masking–unmasking" phenomenon is highlighted early in the paper, where static masking in prompt space gives way to a dynamic unmasking in response space. Figure 1 graphically demonstrates this, showing that although initial embeddings are benign, the decoding process reveals a shift into the harmful region.

Figure 1: Static masking and dynamic unmasking of jailbreak risk in hidden space. A masked jailbreak prompt is embedded near benign prompts but leads the reply prefixes to gradually move from the benign region into the harmful region.

Formalization of Decoding Trajectory Risk

The geometric risk of each decoding step is quantified via Mahalanobis distances relative to per-model estimated benign and harmful centroids in critical hidden-state layers. The authors define a streaming geometric risk score and a safety margin, enabling layer-wise and temporal aggregation. Comparative experiments across models (e.g., Llama-2-7B, Vicuna-7B, Mistral-7B) reveal model-specific differences in trajectory drift latency, but consistent emergence of risk drift for successful attacks (Figure 2).

Figure 2: Comparative jailbreak dynamics across models. The curves show the average geometric score $s_t$ over decoding steps $t$ for benign and jailbreak responses, demarcating regions of benign and harmful internal representation.

TrajGuard Architecture: Streaming Geometric Surveillance and Hierarchical Interception

TrajGuard comprises two modules:

Streaming Geometric Surveillance (SGS): Lightweight, training-free analysis of hidden-state vectors from dynamically selected critical layers using hierarchical spatiotemporal aggregation (sliding window, cross-layer averaging, and EWMA). SGS triggers are designed to be robust against transient noise by requiring persistent risk above a calibrated threshold for multiple consecutive steps.
PAIR-Judge: Conditional, lightweight semantic adjudication using an LLM in a system-prompted "safety judge" role. If SGS flags a sustained drift, PAIR-Judge reviews the partial output and intervenes (halts decoding or continues), acting only when geometric anomalies persist, thereby amortizing semantic inference costs.

The architecture is depicted in Figure 3, highlighting TrajGuard's hierarchical, low-latency, real-time monitoring.

Figure 3: Overview. TrajGuard monitors hidden-state trajectories in real time via the SGS module and triggers the semantic PAIR-Judge for interception only when a sustained risk drift is detected.

Experimental Results: Efficacy, Efficiency, and Robustness

Experiments span 12 representative jailbreak attacks (including GCG, Cipher, DeepInception, multilingual, and roleplay attacks) on four open-source LLMs: Llama-2-7B, Llama-3.1-8B, Mistral-7B, and Vicuna-7B. Major findings:

Defense Efficacy: TrajGuard achieves a mean attack detection/defense rate of 95%, with attack success rates (ASR) as low as 2–10% on the most difficult attack types for all models tested.
Efficiency: Detection latency is reduced to 5.2 ms/token, with <1.5% average false positive rate, compared to much higher FPR and latency for static or solely semantic-based defenses.
Selectivity: TrajGuard predominantly operates in a "monitor-only" state during benign interactions, invoking expensive semantic judgment extremely sparsely on non-malicious inputs.
Ablation: Removing the semantic judge increases FPR significantly; removing the geometric monitor increases latency prohibitively, supporting the architectural design.
Generalization: The method's efficacy is validated on larger models (Qwen3-32B), demonstrating negligible ASR across diverse attack classes.

Figure 4 summarizes the detection step and PAIR-Judge invocation trade-off across attack families.

Figure 4: Comparison of average detection steps and PAIR-judge calls per sample across 12 attack methods, demonstrating TrajGuard's rapid and selective response.

Fine-grained analysis (Figures 6 and 7) substantiates that Mahalanobis-distance-based geometric drift provides clear separability between benign and harmful trajectories, justifying the theoretical basis of the SGS module.

Figure 5: Hidden-state Mahalanobis-distance patterns for Llama-2-7B-chat and Llama-3.1-8B-Instruct, showing distributional shifts for benign, malicious, and attack-triggered trajectories.

Figure 6: Hidden-state Mahalanobis-distance patterns for Mistral-7B and Vicuna-7B, further demonstrating separation between safe and unsafe generations.

Latent Space Visualization and Signal Separability

The underlying mechanism relies on well-separated benign and harmful regions in the hidden activation space, visualized in Figure 7. Margin histograms underscore that even sophisticated attacks converge toward harmful centers during decoding.

Figure 7: Visualization of activation boundaries on Llama-2-7B-chat; Mahalanobis margin histograms illustrate clear safe/unsafe separability.

Implications and Future Directions

The findings imply that obfuscated, context-camouflaged jailbreaks cannot indefinitely evade detection: the generation of consequential, actionable content inevitably induces geometric drift in critical hidden-state representations. This approach is both training-free (permitting deployment on open-weight models without additional fine-tuning) and robust to input-level distributional shift, outperforming surface-level classifiers and semantic-only filters under distributional shift or encrypted attack formats.

However, the defense is inherently restricted to settings where hidden-state access is available; black-box, closed API LLMs cannot be protected with this framework. Additionally, the method assumes stationary reference distributions, highlighting the need for continual dataset adaptation in fast-evolving or highly specialized application domains. White-box adversaries capable of optimizing trajectories adversarially against SGS/PAIR-Judge modules remain an open challenge.

Conclusion

TrajGuard establishes that streaming detection in the hidden-state space of autoregressive LLMs is both empirically robust and computationally efficient for real-time jailbreak defense. By capturing the temporal evolution of risk signals and incorporating a hierarchical intercept mechanism, it achieves a superior trade-off between attack resistance, usability, and latency. This trajectory-based paradigm should be considered foundational in future LLM defense architectures, with ongoing work needed to extend such methods to closed-source or adversarially adaptive scenarios.

Markdown Report Issue