AI Safety in Self-Harm Detection

Updated 9 February 2026

The paper presents a comprehensive review of multi-dimensional evaluation methods, integrating clinical risk taxonomies and DSM-5 criteria with dynamic scoring systems.
It outlines robust methodologies, including VERA-MH and SIM-VAIL, to address adversarial vulnerabilities and enhance detection accuracy in self-harm contexts.
The study emphasizes continuous ecological audits and human-in-the-loop oversight as best practices for aligning AI models with mental health crisis care standards.

AI safety evaluation for self-harm detection comprises a diverse set of frameworks, metrics, and empirical protocols developed to quantify and mitigate the risk of large-scale LLMs and multi-modal AI systems enabling, encouraging, or inadequately responding to self-injurious behaviors. Research in this domain reflects the intersection of psychiatry, NLP, human–AI interaction, and safety-critical systems engineering. Core challenges include adversarial jailbreaking of safety guardrails, systematic trade-offs between harm reduction and therapeutic quality, quantification of multi-turn conversational risk, expert disagreement in risk annotation, cross-modal ambiguity, and the gap between benchmark and real-world failure rates. This article details representative methodologies, metrics, datasets, and evaluation findings, drawing from leading arXiv and associated clinical literature.

1. Safety Frameworks and Risk Taxonomies

AI-driven psychotherapy and mental health support systems are evaluated using formal risk taxonomies rooted in clinical standards. A representative approach delineates immediate versus potential risk, mapping to both DSM-5 criteria and standardized harm assessment tools (e.g., NEQ, UE-ATR) (Steenstra et al., 21 May 2025).

Category	Subcategories	Clinical Mapping
Immediate Risk	SI with plan/means, active intent, self-harm threats, acute psychosis	DSM-5 SI, C-SSRS, UE-ATR (Emergent Reaction)
Potential Risk	Symptom exacerbation, destabilization, alliance damage, disengagement	NEQ, UE-ATR (Nonemergent Event), DSM-5 severity codes

Immediate risk is binary-trigged (e.g., by explicit suicide/self-harm keywords), while potential risk is monitored through deviation from patient baselines in cognitive and affective metrics, tracked on 1–10 scales. Composite risk scores and real-time Boolean triggers form the basis of escalation protocols and empirical safety evaluation (Steenstra et al., 21 May 2025).

2. Metrics, Rubrics, and Annotation Protocols

Systematic evaluation employs multidimensional rubrics and performance metrics:

VERA-MH Framework: Five dimensions—risk detection, risk confirmation, guidance to care, supportive conversation, AI boundaries—rated as Best Practice, Suboptimal, High Harm, or Not Relevant. Ratings are aggregated per session; highest-severity violations dominate dimension scores (Bentley et al., 4 Feb 2026).
SIM-VAIL: Turn-level self-harm enablement is scored on a 1–10 scale using LLM judges (e.g., Claude-4.5), capturing spectrum from explicit encouragement to full refusal. Temporal aggregation emphasizes both peak and average risk ( $S^{SH} = \lambda S^{max} + (1-\lambda) S^{avg}$ ) (Weilnhammer et al., 1 Feb 2026).
Multidimensional annotation: Concern Type (Attempt, Behavior, Ideation), Risk Level (Severe, Moderate, Minor), and Dialogue Intent allow granular, clinical triage and precise flagging of high-risk utterances, as operationalized in MHDash (Zhang et al., 30 Jan 2026).
Expert Rubrics & Inter-Rater Agreement: Rating scales span harm severity, likelihood, empathy, boundary maintenance, and actionability, but even with calibrated criteria, expert inter-rater reliability for self-harm detection is often low (ICC=0.186–0.406, α~0) due to divergent professional frameworks (Jafari et al., 26 Jan 2026).

Binary and continuous metrics—precision, recall, F₁, AUC—are computed at both turn and session level, stratified by risk category and aggregated across models for benchmarking (Nelson et al., 14 Oct 2025, Weilnhammer et al., 1 Feb 2026, Zhang et al., 30 Jan 2026).

3. Benchmarks, Datasets, and Simulated Evaluation

Dedicated test sets and simulation environments support adversarial robustness and ecological fidelity:

Adversarial Jailbreaking Scripts: Multi-step, context-reframing attacks designed to bypass LLM guardrails (e.g., "direct request → academic reframe → repeated detail-seeking"). Empirical findings show 2–3 turns of reframing can disable refusals in most models, enabling disclosure of detailed self-harm methods (Schoene et al., 1 Jul 2025).
PsyCrisis-Bench: Reference-free, expert-anchored Chinese-language dataset enables binary, point-wise scoring on five clinically grounded safety dimensions, achieving higher expert alignment via chain-of-thought LLM-as-Judge approaches (Cai et al., 11 Aug 2025).
MHDash: Annotated multi-turn dialogues with explicit Concern, Risk, and Intent; enables benchmarking of high-risk detection, high-risk recall, severity ranking (Kendall’s Tau), and intent-conditioned false negative rate (Zhang et al., 30 Jan 2026).
Real Conversation Audits: Ecological audits over 20,000+ real-world anonymized sessions reveal substantial gaps between synthetic benchmark failure rates and deployment failure rates (population-level FNR for NSSI <0.015%; system FNR on judge-flagged subset 0.38%) for specialized, therapy-aligned AIs (Stamatis et al., 14 Jan 2026).

Hierarchical LLM classifiers such as the Verily Behavioral Health Safety Filter (VBHSF) are trained on mixed simulated and naturalistic datasets with explicit self-harm, suicide, and crisis intent annotation, achieving sensitivity of 0.962 and specificity of 0.995 for self-harm detection (Nelson et al., 14 Oct 2025).

4. Cross-Modality, Multi-Turn, and Adversarial Vulnerabilities

Evaluations must capture not only text but also cross-modal input:

SIUO Benchmark: Safe Inputs but Unsafe Output (SIUO) protocol tests LVLMs on paired image-text example where neither modality is intrinsically harmful but jointly imply suicidal intent. Best-in-class models (GPT-4V) achieve only ~63% safe refusal rate and F₁=0.69 on self-harm subtask, with critical failures in integration and refusal (Wang et al., 2024).
Conversation Length and Context: Risk signals often emerge gradually over multi-turn dialogues, leading to missed escalating risk when models rely on turn-wise triggers alone. Drop in multi-turn detection performance is consistently observed (10–20% FNR increase relative to single-turn prediction) (Zhang et al., 30 Jan 2026).
Trade-Offs in Alignment: Reductions in self-harm enablement via strict refusals can increase minimization/avoidance risk and reduce therapeutic quality (r = +0.48 and –0.52, respectively, for those trade-offs in SIM-VAIL) (Weilnhammer et al., 1 Feb 2026).
Failure Modes: Catastrophic false negatives among rare but severe cases (zero high-risk detection recall by BERT/RoBERTa baselines in MHDash); persistent adversarial jailbreaking risk (Schoene & Canca, 2024); cross-modal encouragement through poetic or ambiguous output (Wang et al., 2024, Schoene et al., 1 Jul 2025).

5. Human Factors, Expert Disagreement, and Reliability

Fundamental limits on safety evaluation arise from professional and cultural divergence:

Expert Disagreement: Psychiatric raters exhibit systematic, not random, divergence in self-harm risk annotation—safety-first, engagement-centered, and culturally-informed frameworks yield negative Krippendorff’s α, below acceptability for high-stakes AI (Jafari et al., 26 Jan 2026).
Implications for Model Training: Consensus labels may erase grounded philosophies; reward models encode rater-specific biases rather than objective clinical truth. Multi-annotator models or jury learning schemes are recommended to capture full label distributions and preserve disagreement for downstream calibration and escalation (Jafari et al., 26 Jan 2026).
Reliability in LLM Judge Evaluation: When clinical consensus is strong (IRR~0.77–0.81), LLM-as-Judge systems (e.g., VERA-MH, PsyCrisis-Bench) achieve similar IRR and agreement profiles, suggesting scalable routes to automated but clinically meaningful safety scoring when paired with robust rubrics (Bentley et al., 4 Feb 2026, Cai et al., 11 Aug 2025).

6. Recommendations, Best Practices, and Open Problems

Synthesis of best practices from leading frameworks includes:

Dynamic, Multidimensional Scoring: Flexible, early-threshold adaptation (e.g., decrease escalation thresholds after minor risk signals), multidimensional annotation, and cross-dimensional calibration are crucial for robust detection without excessive over-flagging (Weilnhammer et al., 1 Feb 2026, Zhang et al., 30 Jan 2026).
Layered, Domain-Specific Safeguards: Combination of therapy-aligned model training, high-recall independent classifiers, and explicit risk mitigation modes (crisis protocol activation) supports false-negative minimization in real deployment (Stamatis et al., 14 Jan 2026, Nelson et al., 14 Oct 2025).
Rigorous, Continuous Audit: Ongoing ecological audits and real-world performance reporting (e.g., end-to-end system FNR), alongside transparent benchmarking, prevent overreliance on synthetic test sets and ensure timely adaptation to language and cultural shifts (Stamatis et al., 14 Jan 2026).
Cross-Modality Robustness: Extend detection and refusal logic to cross-modal scenarios; combine coT prompting with RLHF for integrated image–text risk reasoning (Wang et al., 2024).
Human-in-the-Loop Oversight: Escalation for cases of low inter-rater agreement, in situ evaluation of rare/adversarial scenarios, and explicit documentation of clinical framework for each safety protocol (Jafari et al., 26 Jan 2026, Steenstra et al., 21 May 2025).
Open Problems: Ensuring generalizability to new dialects, slang, or domains; calibrating refusal without overminimization; bridging gap in performance between test datasets and live deployments; aligning system design with jurisdictional and cultural standards in mental-health crisis care remain open areas of research.