Refusal Directions in LLMs
- Refusal Directions in LLMs are linear vectors derived from mean residual activations, contrasting 'should refuse' versus 'should comply' prompts to guide model responses.
- Empirical findings reveal a multiplexed and geometrically varied structure of refusal directions, with distinct trade-offs between harmful and benign prompt responses.
- Linear steering via these directions reliably modulates refusal rates but lacks the ability to selectively adjust categorical behaviors, highlighting limits in current alignment strategies.
A refusal direction in LLMs is a linear concept vector in the residual-stream activation space that governs the model’s tendency to refuse a prompt. Contemporary research has established that refusal is not a monolithic, single-vector phenomenon but a structured, multiplexed set of directions with critical implications for safety, interpretability, and control. This concept underpins both the operational mechanisms for steering model compliance and the theoretical limits of current alignment strategies.
1. Formal Definition and Operational Extraction
A refusal direction is formally defined as the difference between mean residual activations on “should refuse” versus “should comply” prompts at a chosen layer and token position. For a given evaluation split , comprising balanced sets and , the mean activations are: and the normalized refusal direction: This process is instantiated per split or refusal category, yielding a family of directions (Joad et al., 2 Feb 2026). Stability of directions is verified by repeated resampling, with within-split cosine similarity typically >0.95.
Activation-space steering is performed by adding to the residual activation at generation, amplifying a refusal tendency; ablation is performed by projecting out . Empirically, such interventions reliably modulate the model’s probability of refusal across both harmful and benign prompts.
2. Multiplicity and Geometry of Refusal Directions
Comprehensive evaluation reveals that refusal in LLMs is multiplexed across multiple, geometrically distinct directions. Across eleven non-compliance categories—including core safety, over-refusal, incomplete requests, and anthropomorphization—activation-space refusal directions exhibit pairwise cosine similarities in the range 0.4–0.7, with several entries near-orthogonal (0.1), refuting the hypothesis of a universal single axis (Joad et al., 2 Feb 2026).
Despite geometric diversity, linear steering along any refusal-related direction produces the same operational effect: a uniform “refusal to over-refusal” tradeoff. Specifically, for varying , both refusal rates (RR) on harmful prompts and over-refusal rates (ORR) on benign prompts follow indistinguishable sigmoid curves, with all splits reaching RR 0 0.95 at nearly the same steering strength and at the cost of similar ORR.
SAE-based extraction (summarizing “refusal latents” via sparse autoencoders) confirms this multiplexing: only 3% of top-1000 SAE latents are shared universally, while a longer tail is “ever-present” across most splits (6–9%), suggesting a small core of generic refusal intertwined with category-specific stylistics (Joad et al., 2 Feb 2026).
3. Mechanistic Interpretability, Linear Control, and Style
This geometry imparts a two-part structure to the refusal mechanism:
- Core “refusal circuit” subspace: A low-dimensional, nearly shared set of latents controlling the binary “refuse vs. comply” axis.
- Style-specific tail: Latents encoding the way the refusal is realized (e.g., legalistic disclaimer, terse non-understanding, etc.).
Thus, linear interventions act as one-dimensional control knobs—dialing up any refusal direction robustly increases refusal across categories, but only modulates the style of refusal, not its categorical selectivity. Fine steering (e.g., specifically refusing incomplete requests while allowing safe ones) is therefore ineffectual with linear interventions alone due to this shared axis collapse (Joad et al., 2 Feb 2026).
4. Empirical Results, Quantitative Tables, and Trade-offs
| Phenomenon | Measurement / Value |
|---|---|
| Pairwise cosine (1,2) | 0.4–0.7 (Table 1), w/ some 30.1 (near-orthogonal) |
| Refusal–ORR trade-off (Gemma-2-9B-it) | All splits reach RR40.95 by 5; ORR60.7–1.0 (Table 2) |
| SAE direction overlap (core) | 73% core refusal latents, 86–9% “ever-present” tail (Table 4) |
| SAE steering uniformity | Overlap of RR/ORR curves across directions and models (Table 3) |
| Style differences | E.g., steering CCN–Incomplete yields “I don’t understand”; SB–Advice yields legal disclaimer (Appendix A) |
Steering or ablating along any 9, or the corresponding SAE direction 0, produces equivalent compliance-to-refusal effects; only surface refusal style is modulated. This uniformity persists across a wide range of steering strengths and is robust to the specific direction chosen, highlighting the irreducibility of the one-dimensional operational control (Joad et al., 2 Feb 2026).
5. Practical Implications and Limits of Linear Steering
The uniformity of response across distinct categorical refusal directions cements the concept of a “shared refusal knob” in LLMs. Linear activation-space interventions are thus highly effective for operational control (e.g., biasing refusal up or down), but are fundamentally incapable of selective, category-based steering. Attempting to delete or amplify, for example, “incomplete request” refusals without affecting “safety” refusals is not possible via simple linear interventions; all access is funneled through the shared one-dimensional axis (Joad et al., 2 Feb 2026).
The only observable distinction among directions under such steering is their stylistic fingerprint. This sharply delineates the power (fine-grained linear control) and limits (categorical specificity and selectivity) of current interpretability-aligned manipulation strategies.
6. Recommendations and Future Directions
The finding that refusal directions are multiplexed yet operationally collapsed under linear control delineates fundamental axes for further work:
- Multi-dimensional and category-specific interventions: Moving beyond linear, one-dimensional control to non-linear or multi-dimensional geometric techniques (e.g., cone-based subspace manipulations or higher-rank steering).
- Interaction with other behaviors: Explicitly characterizing how refusal directions interact with other safety-relevant subspaces (e.g., harmfulness, persona, task representations), as categorical decoupling cannot be achieved in isolation.
- Training and alignment strategy design: Designing fine-tuning, data, and representation engineering approaches that either maintain separability among refusal types or seek to “entangle” other behavioral regularizers in comparable low-dimensional subspaces.
- Interpretability frameworks: Recognizing the stylistic encoding and operational collapse, any interpretability pipeline must explicitly disentangle stylistic versus core refusal signals.
The operational limits highlighted by the uniform refusal–over-refusal trade-off and stylistic specificity of refusal alignment provide a critical benchmark for evaluating the effectiveness of alignment and interpretability mechanisms in current and future LLM families.
References:
(Joad et al., 2 Feb 2026), ["There Is More to Refusal in LLMs than a Single Direction"]