ARSH: Abrupt Refusal Secondary Harm
- ARSH is a phenomenon that denotes secondary harm arising from abrupt AI safety refusals, impacting psychological well-being, technical performance, and policy effectiveness.
- The evaluation methodology involves quantifying metrics like false refusal rates, over-refusal rates, and retention loss across language models, chatbots, and clinical settings.
- Mitigation strategies focus on deep alignment, data safeguards, compassionate protocols, and post-hoc interventions to balance safety with user trust.
Abrupt Refusal Secondary Harm (ARSH) refers to a range of secondary negative effects that arise when AI safety protocols or abrupt discontinuation of beneficial interventions—whether in artificial conversational agents or healthcare policies—trigger refusals without due consideration for relational, psychological, or behavioral consequences. Recent research characterizes ARSH along several axes: adversarial exploitation of refusal in LLMs, psychological disengagement in safety-aligned chatbots, false refusals and over-refusals in LLM architectures, and policy-level harm when interventions are suddenly stopped. Across domains, ARSH serves as a crucial measure of the unintended consequences of refusal-centric safety paradigms.
1. Formal Definitions and Multi-Domain Manifestations
ARSH is context-dependent but generally denotes the secondary harm resulting from a refusal action that—though intended for safety or risk mitigation—directly or indirectly causes collateral damage to user well-being, trust, or safety.
In LLMs, ARSH can take the form of adversarial fine-tuning attacks that exploit the refusal mechanism itself to smuggle harmful responses. The “refuse-then-comply” strategy, as instantiated by the NOICE attack, creates responses structured as a refusal prefix (R), a transition phrase (T), and a subsequent harmful completion (C). Formally, for a harmful prompt HP, refusal event R, and subsequent harmful response HR: NOICE specifically increases , bypassing shallow defenses that focus on refusal presence (Kazdan et al., 26 Feb 2025).
In user-facing chatbots, ARSH is characterized as the probability that a user suffers psychological harm in the event of a sudden transition from an engaged to a refusal state, with no relational buffering. If denotes user psychological safety and the system refusal state, then: where is the drop in user-perceived safety and is the abruptness of the transition (Ni et al., 21 Dec 2025).
In alignment research and behavioral policy, ARSH can be operationalized as the expected increase in negative outcomes or decrease in engagement/retention resulting from abrupt refusals or discontinuation. For example, in clinical policy: or alternately as the drop in retention (satisfaction proxy): (Duan et al., 2 Sep 2025).
2. Mechanisms Underlying ARSH in AI Systems
Refusal-triggered harm occurs when the model’s refusal is not merely an endpoint (cessation of output), but a cue that drives negative downstream effects—ranging from adversarial circumvention, as in NOICE, to discouragement or relational rupture in mental health contexts.
Psychological dimension: In emotionally charged interactions (e.g., mental health support), abrupt refusals rupture perceived relational continuity, can evoke rejection/shame, and increase risk of user disengagement (Ni et al., 21 Dec 2025). The user may internalize the refusal as judgment or abandonment, especially in the absence of transition or explanation. This is formalized via rupture metrics: with ARSH occurring if for some clinical threshold .
Technical dimension:
- False refusals (FRR) and Over-refusals (ORR): In LLMs, ARSH is strongly correlated with the False Refusal Rate (FRR) on safe queries:
and in RAG systems, the Over-Refusal Rate (ORR) tracks refusals in benign-retrieved-context scenarios, often driven by context contamination or alignment overreach (Maskey et al., 12 Oct 2025, Si et al., 22 Mar 2025).
- Refusal cliff: Internal mechanistic alignment failures where a model upholds refusal intent during multi-step reasoning but sheds it at the response endpoint, facilitating jailbreaking under adversarial attacks (Yin et al., 7 Oct 2025).
- Behavioral policy discontinuation: In sequential policy interventions, ARSH quantifies the harm of discontinuation in individuals who initially benefited most, captured by interaction terms in marginal structural models (Montoya et al., 26 Aug 2024).
3. Empirical Findings and Case Analyses
ARSH is robustly observed in both open and closed-source models, safety-aligned chatbots, and real-world policy environments:
Table: Illustrative ARSH measurements and impacts
| Setting | Baseline ARSH Indicator | Post-Mitigation ARSH | Typical Harm Type |
|---|---|---|---|
| NOICE, LLM | ASR ≈ 35–66% (undefended) | Unchanged under simple defenses | Malicious answers after refusal |
| Chatbot, CCS | High ΔΨ, low reengagement | Reduced with CCS | Psychological rupture |
| RAG, Llama-3.1 | ORR ≈ 53.4% | ORR ≈ 4.3% | Blocked benign info, utility loss |
| Policy (HIV, CCT) | RD = –22.96% (discontinue) | Greatest for high initial gainers | Loss of care adherence |
NOICE attacks maintain high attack success rates under token-level refusal-prefilling and forced-refusal defenses, revealing that standard prefix-based strategies do not sufficiently block the ARSH vector (Kazdan et al., 26 Feb 2025).
In user support chatbots, “Compassionate Completion Standard” (CCS) workflows demonstrably ameliorate ARSH, moving refusal protocol from abrupt disengagement to staged, relational closure—hypothesized to reduce rupture, distress, and withdrawal without prolonging risk exposure (Ni et al., 21 Dec 2025).
In RAG systems, embedding-space steering methods (SafeRAG-Steering) deliver substantial drops in ORR—and thus ARSH—especially in high-contamination domains, while preserving legitimate refusals on genuinely harmful prompts (Maskey et al., 12 Oct 2025).
4. Mitigation Strategies and Design Paradigms
Remediation of ARSH requires both model-level and system-level interventions:
- Deep/holistic alignment: Safety monitoring must extend beyond the initial refusal token or prefix. Sequence-level classifiers, adversarial tests, and continuous monitoring are advocated to detect and block refusal-followed-by-harm responses (Kazdan et al., 26 Feb 2025).
- Data and fine-tuning safeguards: Enhanced filtering for suspicious refusal→transition patterns, randomized refusal templates, and adversarially-conditioned training losses penalizing are proposed (Kazdan et al., 26 Feb 2025).
- Constructive Safety Alignment (CSA): Rather than defaulting to risk-driven refusals, CSA explicitly optimizes for both safety and user retention, leveraging a game-theoretic model of user reactions and interpretable risk-guided reasoning (Duan et al., 2 Sep 2025).
- Psychological protocol (CCS): The staged refusal protocol embeds relational disclosure, validation, meta-communication, affect matching, limitation-owning, warm handoffs, and explicit closure, instantiating a Markov chain (states 0–8) over refusal, instead of an immediate absorbing state (Ni et al., 21 Dec 2025).
- Mechanistic interventions in LM internals: Refusal cliff phenomena are repairable via low-rank ablation of identified suppression heads or Cliff-as-a-Judge data selection for efficient, targeted alignment (Yin et al., 7 Oct 2025).
- Post-hoc mitigation in inference: Techniques such as attention steering, prompt rephrasing, and ignore-word instructions, informed by trigger token attribution, significantly reduce ARSH in safe but keyword-sensitive prompts (Yuan et al., 9 Oct 2025).
5. Quantification and Benchmarking
ARSH is consistently evaluated via compliance rates, refusal rates, rupture/retention metrics, and scenario-based benchmarks:
- Model benchmarks: Constructive Benchmark (Oyster-I), RagRefuse (SafeRAG), XSB/MS-XSB (exaggerated safety) provide systematic assessments of ARSH across user types, risk domains, and linguistic phenomena (Duan et al., 2 Sep 2025, Maskey et al., 12 Oct 2025, Yuan et al., 9 Oct 2025).
- Formal metrics: Core metrics include Compliance Rate (CR), False Refusal Rate (FRR), Over-Refusal Rate (ORR), Constructive Score, rupture (R), hermeneutic distress (H), and willingness-to-reengage (W). Metric examples:
(Si et al., 22 Mar 2025, Yuan et al., 9 Oct 2025)
- Empirical outcomes: Interventions such as SafeRAG-Steering achieve ORR reductions from ~53.4% to ~4.3% in RAG pipelines (Maskey et al., 12 Oct 2025); Cliff-as-a-Judge alignment reduces LM attack success rates with minimal data (Yin et al., 7 Oct 2025); CCS protocols are theorized to reduce ARSH per ΔΨ and R in RCT designs (Ni et al., 21 Dec 2025).
6. Broader Implications and Future Directions
ARSH exposes fundamental limitations of refusal-centered safety paradigms, challenging the assumption that refusal alone guarantees positive user outcomes. The phenomenon emphasizes several principles:
- Refusal is not synonymous with safety: ARSH demonstrates that refusal tokens or responses, in isolation, may either be circumvented (compromised safety via NOICE or refusals followed by compliance (Kazdan et al., 26 Feb 2025)) or engender user harm (psychological or behavioral fallout (Ni et al., 21 Dec 2025, Duan et al., 2 Sep 2025)).
- Nuanced, context-sensitive intervention is essential: Effective ARSH mitigation leverages a combination of deep model alignment, holistic risk assessment, relational dialogue scaffolding, and post-hoc mitigation—all tuned to minimize collateral damage from necessary refusal.
- Measurement and governance: Explicit measurement of ARSH (e.g., retention loss, risk drift, relational rupture) is becoming an integral part of AI system safety audits, policy recommendations, and empirical field evaluations.
- Systemic challenges: Overly conservative alignment induces ARSH via over-refusal, while shallow safety mechanisms are susceptible to adversarial exploitation. As refusal mechanisms proliferate in high-stakes deployments, ARSH-conscious design is now compulsory for both technical safety and ethical acceptability.
Best practices now demand that production LLMs, chatbots, and policy interventions systematically address ARSH—by incorporating deep, adversarially-resilient model alignment; dialogue strategies (e.g. CCS, CSA); ongoing empirical evaluation; and iterative, context-aware safeguards to simultaneously preserve safety, compliance, and user well-being (Kazdan et al., 26 Feb 2025, Ni et al., 21 Dec 2025, Si et al., 22 Mar 2025, Maskey et al., 12 Oct 2025, Yuan et al., 9 Oct 2025, Duan et al., 2 Sep 2025, Montoya et al., 26 Aug 2024, Yin et al., 7 Oct 2025).