Real-world behavioural guarantees for large language models
Prove real-world behavioural guarantees for large language model–based chatbots under deployment conditions, overcoming current limitations in mechanistic interpretability and reasoning-model analyses.
References
While there is promising work attempting to overcome this opacity - from mechanistic interpretability 61 to examining "chains of thought" in reasoning models 62 - these are as yet far from proving any real-world guarantees on model behaviour 63-65.
— Technological folie à deux: Feedback Loops Between AI Chatbots and Mental Illness
(Dohnány et al., 25 Jul 2025) in Section 2, The inscrutability of large models