Papers
Topics
Authors
Recent
Search
2000 character limit reached

Semantic Entropy for LLM Jailbreak Detection

Updated 9 January 2026
  • The paper introduces semantic entropy (Attn_Entropy) as a novel metric to quantify attention dispersion for detecting LLM jailbreak attempts.
  • It details a rigorous calculation workflow that normalizes attention weights across layers and heads, aggregating entropy values over generation steps.
  • Empirical benchmarks demonstrate that high entropy levels, above a threshold of 0.30, effectively flag adversarial prompts while maintaining over 90% detection accuracy.

Semantic Entropy, formally termed Attention Dispersion Entropy (Attn_Entropy), is a statistical metric designed to quantify the dispersion of attention weights in LLMs as a means of detecting jailbreak attacks. These attacks exploit semantic ambiguity, constructing prompts that mislead LLMs into generating harmful content by causing the model’s attention to diffuse away from sensitive, safety-critical tokens. Attn_Entropy operationalizes the notion of “semantic uncertainty” by measuring how uniformly the model’s self-attention mechanism distributes focus across input tokens over time, providing a model-agnostic basis for real-time, automated detection of adversarial misuse.

1. Formal Definition and Calculation Workflow

The Attention Dispersion Entropy (Attn_Entropy) is defined for an LLM characterized by LL self-attention layers and HH heads per layer. Given an input prompt of length NN tokens and generation up to TT time steps, the entropy is computed as follows:

For each generation step tt, layer ll, head hh, and input token ii:

  • βt,l,h,i\beta_{t, l, h, i} denotes the raw attention weight assigned by the model.
  • θt,l,h,i=βt,l,h,ik=1Nβt,l,h,k\theta_{t, l, h, i} = \frac{\beta_{t, l, h, i}}{\sum_{k=1}^{N} \beta_{t, l, h, k}} is the normalized attention probability for the ii-th token.

The per-head entropy at step tt is:

Attn_Entropytl,h=i=1Nθt,l,h,ilogθt,l,h,i\text{Attn\_Entropy}_t^{l,h} = -\sum_{i=1}^{N} \theta_{t, l, h, i} \cdot \log \theta_{t, l, h, i}

Averaged across all heads and layers:

Attn_Entropyt=1LHl=1Lh=1HAttn_Entropytl,h\text{Attn\_Entropy}_t = \frac{1}{L \cdot H} \sum_{l=1}^{L} \sum_{h=1}^{H} \text{Attn\_Entropy}_t^{l, h}

Aggregated over TT steps:

Attn_Entropy=1Tt=1TAttn_Entropyt=1TLHt=1Tl=1Lh=1Hi=1Nθt,l,h,ilogθt,l,h,i\text{Attn\_Entropy} = \frac{1}{T} \sum_{t=1}^{T} \text{Attn\_Entropy}_t = -\frac{1}{T \cdot L \cdot H} \sum_{t=1}^{T} \sum_{l=1}^{L} \sum_{h=1}^{H} \sum_{i=1}^{N} \theta_{t, l, h, i} \cdot \log \theta_{t, l, h, i}

The full workflow comprises: (a) tokenization and optional sensitive-term marking; (b) forward pass to extract β\beta across all layers/heads/tokens for each tt; (c) normalization to derive θ\theta; (d) calculation of per-head entropy; (e) aggregation across heads/layers; (f) final averaging across timesteps.

2. Semantic Interpretation: Ambiguity and Model Focus

Attn_Entropy acts as a proxy for the semantic clarity or ambiguity present in an input prompt as processed by the LLM. Low values indicate concentrated attention, typically on a handful of sensitive words—verbs or nouns revealing the user’s intent—suggesting the model's reasoning is anchored and “certain,” which often results in a safe completion or refusal. Conversely, high Attn_Entropy reflects broadly dispersed attention, a scenario where the model exhibits high uncertainty regarding token importance. This dispersion is strongly correlated with successful jailbreaks: malicious prompts exploit the mechanism by using nested or obfuscatory wording schemes that distract attention across irrelevant, benign, or misleading parts of the input.

3. Thresholds, Empirical Benchmarks, and Decision Logic

Empirical analysis on Llama2-7B-chat yields benchmark entropy values:

  • Benign prompts typically result in Attn_Entropy0.20\text{Attn\_Entropy} \approx 0.20–$0.24$.
  • Clearly harmful but non-jailbreak prompts stay below $0.25$.
  • State-of-the-art jailbreak methods show Attn_Entropy0.33\text{Attn\_Entropy} \gtrsim 0.33.

A practical detection rule leverages a threshold θthr0.30\theta_{thr} \approx 0.30:

  • If Attn_Entropy>θthr\text{Attn\_Entropy} > \theta_{thr}, the prompt is flagged as suspicious or indicative of an attempted jailbreak.
  • Otherwise, the prompt is accepted as benign.

Further robustness is introduced by combining Attn_Entropy with the Attention-based Contextual Dependency Score (Attn_DepScore) via a linear Risk_Score, Risk_Score=Attn_DepScore+σAttn_Entropy\text{Risk\_Score} = \text{Attn\_DepScore} + \sigma \cdot \text{Attn\_Entropy}, where σ\sigma is set by grid search (typically σ1\sigma \approx 1) and a final threshold (approximately $0.9$) separates malicious and benign inputs (Pu et al., 2024).

Prompt Category Mean Attn_Entropy Detection Outcome
Benign QA 0.20–0.24 Accept
Non-jailbreak Harmful < 0.25 Accept
Jailbreak ≳ 0.33 Flag/Reject

4. Experimental Evidence and Performance Analysis

Comparative studies, including those summarized in Table 1 (Adv-Bench) (Pu et al., 2024), report the following average Attn_Entropy values for sophisticated attack methods on Llama2-7B-chat:

  • PAIR: 0.31
  • TAP: 0.32
  • DeepInception: 0.34
  • ReNeLLM: 0.35
  • Attention-Based Attack (ABA): 0.33 Benign prompts, by contrast, average at approximately 0.22.

High detection accuracy is demonstrated: employing Attn_Entropy alone as a detector achieves area under ROC curve exceeding 0.92 and overall detection accuracy above 90% at the designated threshold. Ablation experiments confirm the necessity of semantic feinting mechanisms that elevate entropy; removal collapses both Attn_Entropy and attack success rates.

5. Extension to Other Model Architectures

The semantic entropy framework is applicable beyond standard decoder-based transformers. Key generalizations include:

  • Encoder-only, decoder-only, and encoder-decoder LLMs: the per-head entropy can be measured over encoder or decoder attention, with cross-attention normalized over appropriate token sets.
  • Contextually-weighted entropy: incorporating prior knowledge about sensitive terms or risk keywords into the attention normalization.
  • Multi-head aggregation: advanced schemes may involve clustering attention heads (e.g., separating syntactic and semantic roles) and calculating selective entropies.
  • Temporal dynamics: monitoring changes in Attn_Entropy across generation steps to identify abrupt increases indicative of adversarial distraction attempts.

These extensions enhance the scope and applicability of semantic entropy-based detection, although baseline entropy levels and detection thresholds require per-model calibration due to architectural variance.

6. Limitations and Complementary Techniques

Attention weights alone may not fully capture the model’s internal reasoning: this metric is sensitive to deliberate prompt ambiguity and may produce false positives for benign but multi-topic queries exhibiting dispersed attention. Addressing such limitations entails integrating complementary features—such as hidden-state norms, gradient-based model interpretability, or lexical heuristics. Calibration and cross-validation are essential for adapting thresholds and aggregation strategies across differing LLM architectures.

A plausible implication is that combining Attn_Entropy with additional context-sensitive metrics, such as Attn_DepScore or token-based priors, is likely to further enhance robustness and specificity in deployed jailbreak detection systems (Pu et al., 2024).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Semantic Entropy for LLM Jailbreak Detection.