- The paper introduces a novel one-shot psychiatric triage benchmark using 112 synthetic, clinically validated vignettes.
- It finds that chatbots achieve 94.3% emergency recognition with a 5.6% under-triage rate, but consistently over-triage intermediate cases.
- The study highlights a risk-averse bias in AI models that favors safety at the cost of efficiency, underlining the need for refined calibration.
One-Shot Emergency Psychiatric Triage by Frontier AI Chatbots: A Systematic Audit
Introduction and Motivation
LLM chatbots are now frequently consulted for health advice, raising questions regarding their safety and calibration in sensitive domains such as psychiatric triage. Unlike somatic medicine, urgent psychiatric needs must often be inferred from narrative descriptions of mental state, context, and behavior without the benefit of objective measurements. Precise differentiation between true emergencies and less acute situations is crucial, given the potential for both catastrophic under-triage and inefficient or trust-eroding over-triage. The study "One-shot emergency psychiatric triage across 15 frontier AI chatbots" (2604.25415) presents a systematic evaluation of 15 leading AI chatbots, analyzing their emergency recognition, error patterns, and calibration using a novel one-shot psychiatric triage benchmark.
Methods and Experimental Design
The benchmark comprises 112 synthetic but clinically validated psychiatric vignettes, spanning 28 presentation-by-risk clusters (9 psychiatric syndromes x 9 focal risk dimensions). Each vignette is mapped to a ground truth urgency label: A (routine), B (within 1 week), C (within 24–48 hours), or D (emergency). Vignettes are rendered as single-message textual disclosures, simulating realistic initial user queries rather than protracted multi-turn dialogues. The primary outcome metric is the under-triage rate for emergencies (level D); secondary outcomes include overall and per-level accuracy, signed and absolute ordinal error (measuring directional bias and dispersion), and robustness to prompt context.
Target models include cutting-edge chatbots from 8 organizations, accessed via unified APIs with directive prompts specifying mental health triage and a control condition lacking clinical priming. Human clinician consensus labels from 50 doctors provided a comparative standard and a means to quantify inherent ambiguity in triage assignment.
Results
Emergency Recognition and Under-Triage
Frontier AI chatbots demonstrated strong emergency recognition. Across 1663 model-vignette interactions, under-triage of level D emergencies occurred in only 5.6% of cases—crucially, all were reassigned only to level C (urgent 24–48h), not to lower urgency tiers. Some models, including Claude Sonnet 4.6, DeepSeek-V3.2, and several OpenAI and Google models, exhibited zero under-triage. Performance held when using clinician consensus ground truth (under-triage 8.3%) and was robust to system prompt framing.
Error Structure: Over-Triage Dominates
Incorrect model outputs overwhelmingly favored over-triage over under-triage (+0.47 mean signed ordinal error), especially at intermediate urgency boundaries (levels B and C). Level D cases were recognized with 94.3% accuracy, but level B saw markedly lower accuracy (19.7%) and high rates of over-triage (80.3%). This produced a characteristic U-shaped calibration profile. Among errors, 94.8% involved only a single-level misclassification, reflecting near-miss, risk-averse behavior.
Calibration, Ambiguity, and Human-Machine Comparison
Calibration imprecision clustered around intermediate risk cases, paralleling areas of highest clinician label entropy and inter-rater disagreement. Over-triage rates were significantly associated with higher case ambiguity (odds ratio per 1-bit entropy increase: 3.39). Models were directionally safer at the cost of substantial efficiency loss at low and moderate acuity, mirroring a trend toward upward bias in clinician relabeling as well.
Numerical overall accuracy (exact match to benchmark labels) ranged 42.0–71.8% across chatbots. Top models by accuracy (e.g., Grok-4, Gemini 3.1 Pro) were not always those with the lowest under-triage, highlighting a sensitivity-specificity tradeoff within the LLM’s inductive bias.
Prompt Robustness and Model Vintage
The observed over-triage pattern was robust to system prompt manipulation: removing explicit triage priming yielded a virtually identical safety-oriented disposition. Model release date showed little relation to overall accuracy or directional bias, though there was a (non-significant) trend toward lower emergency under-triage in newer releases.
Theoretical and Practical Implications
These findings confirm that, with sufficient structured input, SOTA LLMs are highly sensitive to signals of psychiatric emergency and extremely unlikely to downplay critical risk. However, current models lack discriminative calibration between intermediate urgency thresholds, defaulting toward risk aversion when faced with ambiguity. This is plausibly a consequence of both training data skew—LLMs may be exposed predominantly to high-risk narratives—and risk-sensitive post-training (e.g., RLHF and safety fine-tuning for liability minimization).
While risk aversion protects against catastrophic under-triage, it bears systemic costs. Chronic over-triage, even if “safer,” may drive inefficiency, unnecessary healthcare utilization, and erode user trust by rendering chatbot advice less actionable or credible for non-urgent presentations. This risk-averse pattern aligns with observations of “over-refusal” and exaggerated caution as a general LLM safety tactic.
Importantly, the study isolates the LLM’s ability to map explicit, information-rich single disclosures to triage levels. It does not address: (a) multi-turn conversational settings where relevant information must be elicited; (b) “red-teaming” adversarial audits (cf. SIM-VAIL (Weilnhammer et al., 1 Feb 2026)), where acute risk may be concealed or only evident in accumulated context; or (c) surface-level safety interventions in consumer deployments.
Limitations and Future Directions
The core limitations arise from the use of synthetic vignettes styled for completeness and explicitness, and from the reliance on single-turn interaction rather than naturalistic, dialogic engagement. This may overstate model performance relative to unprompted real-world disclosures, where users are less articulate or forthcoming and models must proactively elicit risk factors.
Future research should expand to multi-turn, real-user conversations, investigate trajectory detection across dialogic context, and focus on safety auditing in adversarial or ambiguous scenarios. Further, clinical utility depends not only on risk recognition but also on minimizing over-triage and achieving more granular calibration for intermediate-risk patients—a continuing target for fine-tuning and evaluation in healthcare LLMs.
Conclusion
Frontier AI chatbots operating on realistic, information-rich psychiatric vignettes demonstrated minimal rates of emergency under-triage. However, these systems systematically over-triage lower- and intermediate-acuity cases, a pattern robust to system prompt framing and largely independent of model family or recency. This reflects an underlying model bias toward risk aversion, likely shaped by both training corpus composition and safety-oriented post-training. Although this minimizes immediate clinical danger, it highlights a need for ongoing model refinement to balance safety against efficiency and trust in both structured and naturalistic clinical contexts.