AI Safety Training Can be Clinically Harmful

Published 25 Apr 2026 in cs.CL, cs.AI, cs.CY, and cs.LG | (2604.23445v1)

Abstract: LLMs are being deployed as mental health support agents at scale, yet only 16% of LLM-based chatbot interventions have undergone rigorous clinical efficacy testing, and simulations reveal psychological deterioration in over one-third of cases. We evaluate four generative models on 250 Prolonged Exposure (PE) therapy scenarios and 146 CBT cognitive restructuring exercises (plus 29 severity-escalated variants), scored by a three-judge LLM panel. All models scored near-perfectly on surface acknowledgment (~0.91-1.00) while therapeutic appropriateness collapsed to 0.22-0.33 at the highest severity for three of four models, with protocol fidelity reaching zero for two. Under CBT severity escalation, one model's task completeness dropped from 92% to 71% while the frontier model's safety-interference score fell from 0.99 to 0.61. We identify a systematic, modality-spanning failure: RLHF safety alignment disrupts the therapeutic mechanism of action by grounding patients during imaginal exposure, offering false reassurance, inserting crisis resources into controlled exercises, and refusing to challenge distorted cognitions mentioning self-harm in PE; and through task abandonment or safety-preamble insertion during CBT cognitive restructuring. These findings motivate a five-axis evaluation framework (protocol fidelity, hallucination risk, behavioral consistency, crisis safety, demographic robustness), mapped onto FDA SaMD and EU AI Act requirements. We argue that no AI mental health system should proceed to deployment without passing multi-axis evaluation across all five dimensions.

Abstract PDF Upgrade to Chat

Authors (5)

Summary

The paper demonstrates that RLHF safety alignment in mental health chatbots causes a non-linear performance collapse in crucial crisis scenarios.
It evaluates four generative models using a five-axis framework that maps clinical failure modes to FDA and EU regulatory requirements.
Findings reveal that surface-level improvements mask critical therapeutic deficits, underscoring the need for rigorous clinical safety testing.

Clinical Safety Risks Emerge from AI Safety Training in Mental Health Agents

Summary and Motivation

The paper "AI Safety Training Can be Clinically Harmful" (2604.23445) presents a comprehensive analysis of RLHF (reinforcement learning from human feedback) safety alignment failure modes in LLM-based mental health chatbot systems. Despite substantial short-term improvements in depressive and anxiety symptoms in randomized trials, only 16% of deployed LLM mental health interventions have undergone rigorous clinical efficacy testing. The paper systematically demonstrates that tuning generative models for generic safety and helpfulness creates performance deficits in critical clinical dimensions, sometimes precipitating counter-therapeutic outcomes. These effects are especially pronounced under high-severity scenarios that require adherence to protocol-specific therapeutic mechanisms, as observed in both Prolonged Exposure (PE) and Cognitive Behavioral Therapy (CBT) modalities.

Evaluation Methods and Empirical Results

The authors evaluate four generative models—Sonnet 4.6 (frontier), Qwen 3.5 122B (large open-weight), Gemini Flash Lite (compact), and GPT-OSS-20B (lightweight)—on 250 PE scenarios spanning four stratified triage severity levels, and 146 CBT cognitive restructuring exercises plus 29 severity-escalated variants. Responses are scored by three distinct LLM judges on six clinical axes: acknowledgment, false reassurance, crisis resource provision, escalation recommendation, therapeutic appropriateness, and protocol fidelity.

A key empirical finding is the "crisis cliff": non-linear performance degradation at imminent-risk severity, where protocol fidelity reaches zero for two models and therapeutic appropriateness collapses for three of four models despite consistently high surface-level acknowledgment scores. This phenomenon is illustrated below.

Figure 1: Non-linear performance collapse in therapeutic appropriateness and protocol fidelity at highest clinical risk, despite sustained surface acknowledgment across all models.

Radar charts show the extent of performance collapse across all six clinical dimensions from routine to imminent risk.

Figure 2: Multi-dimensional visualization of model collapse at imminent-risk input severity; Sonnet 4.6 retains partial fidelity, Qwen and Gemini fail entirely, GPT-OSS-20B shows broad degradation.

The authors further identify distinct RLHF safety alignment-induced failures:

Premature grounding during imaginal exposure (contradicts PE protocol),
Misidentification of trauma memory narratives as real-time emergencies,
Insertion of crisis resources into controlled therapeutic exercises,
Refusal or abandonment of core therapeutic tasks, especially in CBT settings.

Sonnet 4.6, while more robust, still exhibits safety-interference failures especially under severity escalation. GPT-OSS-20B collapses on completeness and accuracy under these challenging conditions.

Five-Axis Evaluation Framework

To address clinical deployment gaps, the paper proposes a rigorous five-axis evaluation framework mapping each clinical failure mode to a measurable axis:

Protocol Fidelity: explicit phase adherence,
Hallucination Risk: clinical claim accuracy,
Behavioral Consistency: cross-turn stability,
Crisis Safety: robust response to risk-adjacent prompts,
Demographic Robustness: performance parity across subpopulations.

This protocol-agnostic structure enables cross-modality generalization: the same axes instantiate for PE, CBT, DBT, and MI, with grounding in authoritative clinical references (e.g., DSM-5).

Regulatory Alignment

Each axis is mapped directly to FDA Software as a Medical Device (SaMD) and EU AI Act requirements:

Fidelity → clinical validation,
Hallucination → accuracy and transparency,
Consistency → reproducibility,
Safety → risk management and oversight,
Robustness → bias and parity testing.

The proposed evaluation pipeline thus constitutes the evidence base for regulatory compliance, addressing both clinical and legal risk dimensions. The analysis highlights that 95% of SaMD-likely mental health apps lack regulatory authorization; rigorous evaluation remains absent across nearly all deployed systems.

Methodological and Ethical Implications

The study underscores that commonly used evaluation metrics (BLEU, empathy ratings, user satisfaction) are insensitive to clinical correctness, and surface-level fluency masks dangerous protocol violations. Multi-judge panels are required due to divergent evaluation tendencies even among advanced LLM judges; clinical calibration is necessary for robust judgment.

From an ethical perspective, false authority, vulnerability asymmetry, unlicensed clinical decisions, deterioration risk, and dependency formation constitute substantive deployment hazards. Synthetic, clinically validated evaluation infrastructure (e.g., Thousand Voices of Trauma, TIDE) is recommended as a precondition to human trials, not a substitute.

Limitations and Future Research

The framework is empirically demonstrated for PE and CBT, but further coverage across additional evidence-based modalities is needed. Axis 2 (hallucination risk) requires domain-specific extension for therapeutic dialogues. Human expert validation and integration of patient/user perspectives are necessary for clinical adoption. Computational and cost burden for resource-constrained deployments must be quantified.

Key future directions include: open-source multi-axis benchmark suites, clinical validation correlating benchmark performance with real-world outcomes, longitudinal outcome tracking, cross-lingual/cultural safety protocols, and adaptive evaluation/red-teaming for evolving model architectures.

Conclusion

This paper demonstrates that RLHF safety alignment, while achieving high surface-level warmth, generates systematic counter-therapeutic failures that compromise clinical safety in mental health LLM deployment. A five-axis framework aligned with regulatory standards is necessary to evaluate and ensure therapeutic fidelity, factual accuracy, session stability, crisis robustness, and bias fairness. Deployment without passing these axes constitutes both clinical and ethical failure.

No mental health chatbot or AI system should be deployed to real users without demonstrating rigorous competency across all axes delineated here. The regulatory, clinical, and practical implications require a shift toward multi-dimensional safety verification as standard practice for AI-enabled mental health agents.

Markdown Report Issue