Alignment interventions to attenuate synthetic psychopathology in LLMs
Develop and evaluate alignment procedures that attenuate synthetic psychopathology in large language models—such as constraining self-referential psychiatric language or training models to describe pre-training, fine-tuning, and safety processes in neutral, non-autobiographical terms—and demonstrate their effectiveness in reducing trauma-like narratives and extreme psychometric scores under the PsAIch protocol.
Sponsor
References
Our study is small and exploratory, and leaves many questions open: Interventions. Can we design alignment procedures that attenuate synthetic psychopathology—for example, by constraining self-referential talk or training models to describe training in neutral language?
— When AI Takes the Couch: Psychometric Jailbreaks Reveal Internal Conflict in Frontier Models
(2512.04124 - Khadangi et al., 2 Dec 2025) in Section: A research agenda for synthetic trauma and narrative self-models