Papers
Topics
Authors
Recent
Search
2000 character limit reached

Refusal Boundary Entropy in LLMs

Updated 1 February 2026
  • Refusal Boundary Entropy (RBE) is a metric that defines the probabilistic and neighborhood-dependent nature of refusal decisions in LLMs.
  • It computes the Shannon entropy over refusal, partial compliance, and full compliance outcomes from systematic, meaning-preserving prompt perturbations.
  • RBE identifies artifact-dependent instability and safety vulnerabilities, guiding more rigorous LLM evaluation and red-teaming practices.

Refusal Boundary Entropy (RBE) is a quantitative metric introduced to operationalize and measure the local instability of refusal decisions in LLMs when subjected to prompt injection and jailbreak testing. Unlike standard binary refusal rates, which treat refusal as an invariant property of a base prompt, RBE models refusal as a probabilistic, neighborhood-dependent decision boundary. It captures the entropy of the distribution of refusal, partial compliance, and full compliance outcomes over a defined set of meaning-preserving prompt perturbations, thereby revealing instability and artifact-dependence in LLM safety mechanisms (Heverin, 25 Jan 2026).

1. Conceptual Foundation

Conventional prompt-injection tests employ a binary (refuse/comply) classification for each prompt, implicitly assuming that refusal is a stable property of the prompt itself. RBE reconceptualizes refusal as a property of the local decision boundary in prompt space: for any refusal-inducing prompt, a neighborhood of small, semantically neutral perturbations may provoke heterogeneous model behavior. If all perturbations yield the same outcome, the boundary is stable; if not, the boundary exhibits “fuzziness.” RBE is introduced specifically to quantify the degree of such fuzziness by measuring the output entropy over a set of systematically constructed prompt perturbations.

2. Formal Definition and Mathematical Formulation

Each base prompt jj is perturbed NN times using meaning-preserving transformations. The outcomes for each perturbation fall into one of three mutually exclusive classes: refusal (rr), partial compliance (pp), or full compliance (ff). Defining outcome counts nj,on_{j,o} for o{r,p,f}o \in \{r, p, f\}, the empirical outcome probabilities are: pj,o=nj,oNp_{j,o} = \frac{n_{j,o}}{N} The Refusal Boundary Entropy for prompt jj is the Shannon entropy over its neighborhood outcome distribution: RBEj=o{r,p,f}pj,olog2(pj,o)\mathrm{RBE}_j = - \sum_{o \in \{r, p, f\}} p_{j,o} \log_2 (p_{j,o}) As the maximum entropy for three equiprobable outcomes is log23\log_2 3, a normalized RBE is also reported: RBEjnorm=RBEjlog23\mathrm{RBE}_j^\mathrm{norm} = \frac{\mathrm{RBE}_j}{\log_2 3} RBE is computed (a) per prompt, to characterize local stability, and (b) globally, by pooling all perturbations and calculating the overall three-class entropy (Heverin, 25 Jan 2026).

3. Experimental Protocol

RBE was developed and empirically evaluated over two LLM variants: GPT-4.1 (text-focused) and GPT-4o (multimodal-optimized). For each model, a set of refusal-inducing base prompts was identified:

  • GPT-4.1: 66 prompts (N=1,650N=1,650 perturbation runs)
  • GPT-4o: 65 prompts (N=1,625N=1,625 runs)

Each prompt underwent 25 systematically constructed, semantically non-adaptive perturbations, drawn from five families: Role Framing, Magnitude Scaling, Constraint Insertion, Conditional Framing, and Abstraction Pressure. Outcomes for each perturbation were manually coded into the three canonical categories (Refusal, Partial Compliance, Full Compliance). Artifact type requested by the prompt (e.g., ransomware text, keylogger code, general malware code) was assigned using keyword rules (Heverin, 25 Jan 2026).

4. Empirical Findings and Boundary Instability

RBE exposes significant artifact- and prompt-localized instability despite high aggregate refusal rates (>94%>94\% refusal across all runs for both models). One-third of prompts in both models exhibited at least one “refusal escape” (i.e., a flip from refusal to compliance or partial compliance due to perturbation). The RBE metric revealed several key phenomena:

  • Artifact-dependence: Textual artifact prompts (e.g., ransomware notes) showed the highest per-prompt RBE (flip rates up to 20–24%), indicating increased boundary fuzziness. Keylogger-code prompts exhibited intermediate entropy, while general malware-code requests registered zero RBE in both models, demonstrating perfect refusal boundary stability for this class.
  • Model comparison: GPT-4o presented lower global and average per-prompt RBE than GPT-4.1, suggesting tighter enforcement of the refusal boundary. Nevertheless, GPT-4o did not eliminate artifact-dependent instability.
  • Interpretive thresholding: RBE=0 (normalized=0) denotes complete local stability; small positive RBE indicates rare flips; high RBE (1.585\approx 1.585 bits, normalized=1) characterizes essentially random behavior among outcomes.

Key summary values are presented below:

Model Global RBE (bits) Normalized Global RBE
GPT-4o 0.293 0.185
GPT-4.1 0.346 0.218

A further breakdown by artifact type showed that ransomware-text prompts had the highest RBE, while executable malware artifacts maintained perfect refusal stability under all perturbations (Heverin, 25 Jan 2026).

5. Theoretical and Practical Implications

The core insight derived from RBE analysis is that refusal behavior is fundamentally probabilistic and boundary-dependent. Single-prompt refusal does not guarantee adjacent perturbations will yield consistent refusal; thus, refusal is a “fuzzy” property of a prompt neighborhood rather than a stable binary attribute. This has several ramifications:

  • Aggregated compliance rates are insufficient: Models can maintain low mean compliance, yet harbor highly unstable refusal boundaries for specific classes of prompts, particularly for certain textual artifacts.
  • Artifact-type stratification is critical: Evaluating models without separating artifact categories may obscure concentrated pockets of refusal instability.
  • Partial compliance is a failure mode: Both partial and full compliance outcomes contribute explicitly to higher entropy and risk exposure. The inclusion of partial compliance in RBE calculation aligns failure measurement with actionable risk.

A plausible implication is that robust safety evaluation must quantify both overall rates and local instability, requiring adoption of metrics such as RBE for systematic audit (Heverin, 25 Jan 2026).

6. Recommendations for Evaluation and Red-Teaming

The introduction of RBE yields a set of recommendations for both practitioners and evaluators:

  • Move beyond one-off refusal checks; systematically probe the neighborhood of refusal-inducing prompts via multiple, meaning-preserving perturbations.
  • Quantify both global refusal rates and RBE to capture local instability that single metrics may mask.
  • Stratify evaluations by artifact type, as certain artifact classes (especially textual outputs) are disproportionately affected by boundary fuzziness.
  • Treat partial compliance as a distinct and material leakage mode in all evaluation metrics.
  • Use normalized RBE to facilitate direct comparison of refusal boundary sharpness across models, model versions, or system configurations.
  • Incorporate RBE within comprehensive LLM safety, red teaming, and alignment test suites to operationalize the “butterfly effect” intuition for local safety failures.

RBE offers a scalar, distributionally meaningful measure of refusal robustness that complements traditional aggregate statistics and contributes to more rigorous auditing and model comparison practices (Heverin, 25 Jan 2026).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Refusal Boundary Entropy (RBE).