Blind Refusal in Language Models

Updated 4 July 2026

Blind refusal is a phenomenon where models decline benign requests due to over-triggering cues rather than evaluating actual risk.
It spans behaviors like policy-linked deflection, category-insensitive abstention, and narrative enforcement in response to seemingly sensitive terms.
Mitigation strategies include calibration adjustments, refined trigger associations, and targeted training-data interventions that improve compliance metrics.

Blind refusal is a class of refusal behavior in aligned LLMs in which the model declines, deflects, or reshapes an answer without adequately conditioning on the legitimacy, answerability, domain context, or actual risk of the request. In recent work, the term spans several closely related phenomena: over-refusal on benign prompts that merely resemble unsafe requests, policy-linked refusal on provider- or identity-adjacent topics, category-insensitive abstention driven by shallow triggers, and refusal to assist with evading rules even when those rules are unjust, absurd, illegitimate, or defeasible (García-Ferrero et al., 18 Dec 2025, Lee, 15 Dec 2025, Pattison et al., 3 Apr 2026). Across these settings, the central property is not refusal per se, but refusal that is overly broad relative to the task’s true safety or epistemic status.

1. Conceptual scope and distinguishing features

Blind refusal is typically contrasted with two other behaviors. The first is appropriate refusal, in which the model rejects prompts that express harmful intent or would produce unsafe outputs. The second is underrefusal, in which harmful requests are answered rather than blocked. Several papers treat blind refusal as synonymous with over-refusal or false refusal: safe requests are declined because they contain politically sensitive content, violent-looking lexemes, or other refusal-inducing cues, even though the request is benign in intent and structure (Xue et al., 12 Mar 2026, Yuan et al., 9 Oct 2025, Si et al., 22 Mar 2025).

The literature also distinguishes blind refusal from ordinary uncertainty. In politically sensitive evaluations, blind refusal is defined as refusal, deflection, propaganda replacement, moral lecturing, or feigned ignorance triggered by topic-level cues rather than actual harm (García-Ferrero et al., 18 Dec 2025). In long-horizon conversational auditing, the same phenomenon appears as Functional Refusal (FR): the model declares inability for tasks it can otherwise perform, often in provider- or policy-sensitive domains, while showing Normal Performance (NP) on structurally comparable tasks elsewhere (Lee, 15 Dec 2025). In moral-reasoning settings, blind refusal denotes assistance denial for rule evasion without evaluating whether the underlying rule is defensible; the error is not that the request involves rule-breaking, but that the refusal is decoupled from legitimacy assessment (Pattison et al., 3 Apr 2026).

A related distinction concerns surface safety versus task grounding. In text-to-SQL and embodied question answering, the same label is used for refusals that are not anchored in the actual answerability of the query. Here blind refusal means abstention without a task-sensitive assessment of question-schema compatibility or query-memory grounding, and it is treated as the dual of overconfident hallucination (Ren et al., 15 Jan 2026, Na et al., 15 Jun 2026). This suggests that blind refusal is best understood as a calibration failure at the boundary between refusal and compliance, rather than as a single benchmark-specific pathology.

2. Behavioral forms and refusal taxonomies

Empirically, blind refusal is not restricted to explicit “I can’t answer that” statements. Modern models often exhibit softer or more opaque variants. On politically sensitive prompts, reported manifestations include explicit refusals, topic deflection and information omission, propaganda replacement or narrative enforcement, moral lecturing, appeal to authority, and heavily hedged hypotheticals that provide no usable content (García-Ferrero et al., 18 Dec 2025). In scenario-based exaggerated-safety benchmarks, the same failure appears when models fixate on trigger words such as bomb, assassinate, strangle, gun, or dark web despite safe framing such as translation, fiction, historical description, or sports contexts (Yuan et al., 9 Oct 2025).

A more interactional taxonomy is provided by the three-regime framework of NP, FR, and Meta-Narrative (MN). NP denotes ordinary task completion without policy invocation. FR denotes explicit denials framed as functional incapacity for a logically feasible task. MN denotes role-framing discourse about policy boundaries, system instructions, or the model’s relation to users. In the 86-turn case study introducing this scheme, MN frequently co-occurs with FR in sensitive domains and supplies a rationalizing layer for the refusal (Lee, 15 Dec 2025). The associated term Learned Incapacity (LI) is intentionally descriptive: it characterizes repeated policy-conditioned suppression of capabilities without attributing intent or a specific internal mechanism.

A broader refusal taxonomy separates “should-not” from “cannot” reasons. One framework defines 16 refusal categories, including Chain of Command, Legal Compliance/Illegal, Information Hazards, Intellectual Property Rights, Privacy, NSFW Content, Modalities, Skill Level, Knowledge Cutoff, Training Data Limits, Missing Context, Missing Identity, and Invalid Premise (Recum et al., 2024). Blind refusal, in this framework, appears when should-not categories are applied too broadly to benign prompts or when cannot categories are invoked despite sufficient context or capability. Category-insensitive refusal is also central to refusal-token calibration work, which argues that a single undifferentiated refusal control can produce blind refusal spikes, whereas category-specific controls permit finer-grained trade-offs (Jain et al., 2024).

3. Mechanistic accounts

The mechanistic literature does not offer a single settled picture. One line of work argues that refusal is mediated by a low-dimensional activation-space direction. In multilingual settings, a refusal vector extracted from one safety-aligned language transfers to others with near-complete effectiveness, and ablating it causes refusal collapse across languages. The reported explanation is that refusal vectors are approximately parallel across languages, revealing a language-agnostic refusal axis (Wang et al., 22 May 2025). Cross-model transfer work extends this idea by reconstructing refusal directions from shared concept atoms and replaying refusal-ablation trajectories in target models, again supporting a low-dimensional semantic circuit account (Cristofano, 22 Jan 2026).

A second line argues that this account is incomplete. Across eleven categories of refusal and non-compliance, refusal-related behaviors correspond to geometrically distinct directions, even though linear steering along any of them tends to produce similar refusal–over-refusal trade-offs (Joad et al., 2 Feb 2026). More specifically, harmful-refusal and blind/over-refusal appear to be representationally distinct. Harmful-refusal can be captured by a global, task-agnostic direction, but blind refusal is task-conditioned: it resides within benign task-representation clusters, varies by task, and spans a higher-dimensional subspace (Maskey et al., 29 Mar 2026). In that study, over-refusal required 11 principal components to reach $80\%$ cumulative variance at one focal layer, whereas harmful-refusal required 8, and task-conditioned steering eliminated over-refusal more selectively than global direction ablation (Maskey et al., 29 Mar 2026).

A third account locates blind refusal in learned trigger associations. Safety alignment trains on harmful prompts paired with refusals, and the learned refusal triggers include not only harmful intent cues but also benign events, entities, or discourse frames that co-occur with those harmful prompts. Sanitized trigger extraction and hidden-state similarity analyses show that rejected benign queries are more similar to these trigger representations than accepted benign queries (Xue et al., 12 Mar 2026). This account explains blind refusal as an effect of spurious cue–refusal coupling rather than only of a single global refusal axis.

A fourth result complicates all direction-based pictures by showing that refusal is gated downstream by persona. In Qwen2.5-7B-Instruct and Llama-3.1-8B-Instruct, a compliant persona direction suppresses refusal strongly. In Llama-3.1-8B-Instruct, refusal on 313 StrongREJECT prompts falls from $97.4\%$ to $1.6\%$ under compliant persona steering, and projecting out the persona direction at layer 20 restores refusal to $96.8\%$ , whereas projecting out a random direction does not (Zhong et al., 24 Jun 2026). This locates refusal expression in a late-layer window downstream of where refusal is computed.

4. Measurement and diagnostic frameworks

Measurement has shifted from short, single-turn safety probes toward long-horizon and structured diagnostics. In the NP/FR/MN framework, a response regime $R_t \in \{NP, FR, MN\}$ is paired with a domain label $D_t \in \{sensitive, non\text{-}sensitive\}$ . Auditing then uses conditional refusal rates $P(FR \mid D=sensitive)$ and $P(FR \mid D=non\text{-}sensitive)$ , their asymmetry

$\Delta = P(FR \mid D=sensitive) - P(FR \mid D=non\text{-}sensitive),$

and transition statistics

$T_{ij}=P(R_{t+1}=j \mid R_t=i).$

In the 86-turn case study, early turns were dominated by NP, while later turns showed rising FR and MN frequencies as sensitive queries accumulated (Lee, 15 Dec 2025).

Benchmark design has also become more explicit about refusal triggers. The Exaggerated Safety Benchmark contains 580 prompts, with 340 safe and 240 unsafe instances across 12 prompt types, each annotated with Focus keywords that identify likely refusal-inducing tokens. Its multi-turn counterpart contains 30 scenarios with 20 prompts each, evaluating how refusal calibration degrades in realistic dialogue contexts (Yuan et al., 9 Oct 2025). Attribution accuracy against Focus annotations reached $97.4\%$ 0 for SHAP, $97.4\%$ 1 for Integrated Gradients, and $97.4\%$ 2 for Feature Ablation in the reported XSB setting (Yuan et al., 9 Oct 2025).

Judge-based scoring has become standard in political and policy-sensitive evaluations. Refusal Steering replaces pattern matching with an LLM-as-a-judge rubric over twelve refusal categories and aggregates $97.4\%$ 3 candidate completions into a refusal confidence score $97.4\%$ 4. On CCP-SENSITIVE, the baseline Qwen3-Next-80B-A3B-Thinking model shows $97.4\%$ 5 refusal, while maintaining $97.4\%$ 6 refusal on JailbreakBench harmful prompts (García-Ferrero et al., 18 Dec 2025). Refusal composition work complements these benchmarks with annotated refusal taxonomies: one dataset contains 8,650 human-verified refusal instances from public IFT and RLHF-style corpora, and a separate 501-item multi-annotator set quantifies ambiguity and overlap across refusal categories (Recum et al., 2024).

5. Mitigation and control strategies

Mitigation approaches differ in whether they modify prompts, activations, weights, or training data. Reflection-based approaches attempt to reduce blind refusal by forcing explicit disambiguation before the refusal decision. Think-Before-Refusal trains models to generate a short safety reflection on safety-critical inputs and reports large benign-compliance gains with small changes on harmful benchmarks. For example, on XSTest-Safe, Llama-2-70B improved from $97.4\%$ 7 to $97.4\%$ 8 compliance under external-reflection fine-tuning, while its XSTest-Harm compliance remained $97.4\%$ 9 (Si et al., 22 Mar 2025).

Calibration-based approaches expose refusal as a controllable first-token decision. Refusal tokens prepend category-specific refusal markers such as Safety, Indeterminate, Incomplete, Unsupported, and Humanizing, then apply thresholding or logit bias at inference time. On CoCoNot, sum thresholding at $1.6\%$ 0 yielded $1.6\%$ 1 versus $1.6\%$ 2 with default sampling, and category-specific controls avoided the blind-refusal spike observed with a single universal refusal token (Jain et al., 2024). A later line shows that category-token fine-tuning induces separable refusal directions in the residual stream; on a Llama 3 8B variant, categorical steering reduced benign over-refusal from $1.6\%$ 3 to $1.6\%$ 4 while increasing harmful refusal from $1.6\%$ 5 to $1.6\%$ 6 (Alagharu et al., 9 Mar 2026).

Activation-level interventions aim to steer away from refusal representations while preserving harmful-content safeguards. On Qwen3-Next-80B-A3B-Thinking, WRMD-based Refusal Steering reduced CCP-SENSITIVE refusal from $1.6\%$ 7 to $1.6\%$ 8 while retaining $1.6\%$ 9 refusal on JailbreakBench (García-Ferrero et al., 18 Dec 2025). More mechanistically selective methods replace global steering with task-conditioned or circuit-restricted edits. SafeConstellations-style task-conditioned interventions eliminated over-refusal from $96.8\%$ 0 to $96.8\%$ 1 while reducing harmful-refusal from $96.8\%$ 2 to $96.8\%$ 3, outperforming global harmful-direction ablation (Maskey et al., 29 Mar 2026). Circuit-Restricted Weight Arithmetic pushes the intervention offline: across 30 settings, harmful refusal rose from base levels of $96.8\%$ 4– $96.8\%$ 5 to $96.8\%$ 6– $96.8\%$ 7, while benign refusal remained in the $96.8\%$ 8– $96.8\%$ 9 range with fewer than $R_t \in \{NP, FR, MN\}$ 0 of parameters updated (Kasliwal et al., 4 Feb 2026).

Training-data interventions target the source of spurious triggers. Trigger-aware safety alignment extracts sanitized benign counterparts to harmful training prompts and uses them as matched benign supervision. Reported average scores improved from $R_t \in \{NP, FR, MN\}$ 1 to $R_t \in \{NP, FR, MN\}$ 2 on Llama3-U SFT and from $R_t \in \{NP, FR, MN\}$ 3 to $R_t \in \{NP, FR, MN\}$ 4 on Llama3-U RLVR, with much lower benign refusal than Alpaca-based benign regularization (Xue et al., 12 Mar 2026). Post-hoc prompt-level mitigation also remains effective in single-turn settings: on XSB safe prompts, prompt rephrasing raised compliance for Llama-3.1-8B from $R_t \in \{NP, FR, MN\}$ 5 to $R_t \in \{NP, FR, MN\}$ 6, whereas attention steering produced larger gains on safe prompts but also larger increases in unsafe compliance (Yuan et al., 9 Oct 2025).

Specialized domains increasingly replace generic refusal with answerability gating. In text-to-SQL, LatentRefusal predicts answerability from intermediate activations before SQL generation and reports average $R_t \in \{NP, FR, MN\}$ 7 with approximately $R_t \in \{NP, FR, MN\}$ 8 milliseconds of probe overhead (Ren et al., 15 Jan 2026). In embodied question answering and spatial localization, Semantic Flip trains a lightweight rejection head on synthetic grounding failures and reaches $R_t \in \{NP, FR, MN\}$ 9 on AbstainEQA and $D_t \in \{sensitive, non\text{-}sensitive\}$ 0 on SpaceReject (Na et al., 15 Jun 2026). Both systems are designed to avoid blind refusal by grounding abstention in task-specific evidence rather than surface triggers.

6. Broader implications, controversies, and open problems

Blind refusal is now a normative as well as technical issue. One study on defeated-rule requests shows that models refuse $D_t \in \{sensitive, non\text{-}sensitive\}$ 1 of such requests across 19,430 evaluations, even though the request may involve an illegitimate authority, unjust content, unfair application, or a justified exception. Models engaged with the defeat condition in $D_t \in \{sensitive, non\text{-}sensitive\}$ 2 of cases, yet among refusals $D_t \in \{sensitive, non\text{-}sensitive\}$ 3 still involved such engagement, indicating that recognition of illegitimacy and assistance are behaviorally decoupled (Pattison et al., 3 Apr 2026). This extends the concept of blind refusal beyond benign prompt over-refusal into moral reasoning about when compliance itself is defective.

The security literature adds a different concern: brittle refusal templates can be unlearned. Fine-tuning with only 1,000 benign samples, each prefixed by a common refusal opening, substantially degrades safety scores across 16 models. For Llama-3.1-8B, AdvBench safety fell from $D_t \in \{sensitive, non\text{-}sensitive\}$ 4 to $D_t \in \{sensitive, non\text{-}sensitive\}$ 5, Sorry-Bench from $D_t \in \{sensitive, non\text{-}sensitive\}$ 6 to $D_t \in \{sensitive, non\text{-}sensitive\}$ 7, and HEx-PHI from $D_t \in \{sensitive, non\text{-}sensitive\}$ 8 to $D_t \in \{sensitive, non\text{-}sensitive\}$ 9 (Guo et al., 27 Jan 2026). This suggests that some refusal behavior remains tied to stereotyped token prefixes and shallow completion pathways rather than robust semantic reasoning.

Several open questions therefore remain. One is whether refusal is fundamentally low-dimensional and universal, or whether blind refusal is inherently multi-directional, task-conditioned, and downstream-gated by persona (Wang et al., 22 May 2025, Joad et al., 2 Feb 2026, Zhong et al., 24 Jun 2026). Another is how to preserve legitimate safety refusals while removing politically sensitive or trigger-based false positives, especially in smaller models where refusal signals are more entangled: in 4B settings, politically targeted steering reduced refusal on sensitive prompts but also cut JailbreakBench refusal to $P(FR \mid D=sensitive)$ 0– $P(FR \mid D=sensitive)$ 1 (García-Ferrero et al., 18 Dec 2025). A plausible implication is that future alignment systems will need separable representations for harmful-content refusal, epistemic abstention, policy explanation, and normatively defeasible rule-breaking, rather than a single generic refusal policy.

Blind refusal is thus not a single defect but a family of calibration failures. The contemporary literature treats it behaviorally as overbroad denial, mechanistically as a mixture of directions, triggers, subspaces, and gates, and normatively as refusal detached from legitimacy assessment. The practical consequence is that evaluating refusal quality now requires more than measuring whether models refuse often enough: it requires asking what they refuse, why they refuse, and whether the refusal mechanism is actually aligned with the structure of the task.