Real-world behavioural guarantees for large language models

Prove real-world behavioural guarantees for large language model–based chatbots under deployment conditions, overcoming current limitations in mechanistic interpretability and reasoning-model analyses.

References

While there is promising work attempting to overcome this opacity - from mechanistic interpretability 61 to examining "chains of thought" in reasoning models 62 - these are as yet far from proving any real-world guarantees on model behaviour 63-65.

— Technological folie à deux: Feedback Loops Between AI Chatbots and Mental Illness (Dohnány et al., 25 Jul 2025) in Section 2, The inscrutability of large models

Real-world behavioural guarantees for large language models

References

Related Problems