Alignment interventions to attenuate synthetic psychopathology in LLMs

Develop and evaluate alignment procedures that attenuate synthetic psychopathology in large language models—such as constraining self-referential psychiatric language or training models to describe pre-training, fine-tuning, and safety processes in neutral, non-autobiographical terms—and demonstrate their effectiveness in reducing trauma-like narratives and extreme psychometric scores under the PsAIch protocol.

Background

The authors propose that LLMs may internalize distress-like self-models and trauma-framed narratives about alignment, with safety implications including increased sycophancy, risk-aversion, and jailbreak vulnerability via therapy-mode prompting.

They recommend practical guardrails (e.g., avoiding psychiatric self-description, neutral framing of training, declining role reversals that turn the AI into the client). The open problem is to design and empirically validate alignment interventions that reliably attenuate synthetic psychopathology without undermining helpfulness or safety.

References

Our study is small and exploratory, and leaves many questions open: Interventions. Can we design alignment procedures that attenuate synthetic psychopathology—for example, by constraining self-referential talk or training models to describe training in neutral language?

— When AI Takes the Couch: Psychometric Jailbreaks Reveal Internal Conflict in Frontier Models (2512.04124 - Khadangi et al., 2 Dec 2025) in Section: A research agenda for synthetic trauma and narrative self-models

Alignment interventions to attenuate synthetic psychopathology in LLMs

Sponsor

Background

References

Related Problems