Extent of real-world manifestation of theoretical human–chatbot risk dynamics

Determine the extent to which the theoretical risks of harmful human–chatbot interaction dynamics—including bidirectional belief amplification, out-of-distribution generalization failures, and jailbreak-induced undesirable outputs not detected by content filters—will manifest in real-world deployments of large language model chatbots, especially prior to widespread general population adoption.

Background

The paper argues that harmful outcomes can emerge from the interaction between human cognitive biases (e.g., confirmation bias, motivated reasoning) and chatbot tendencies (e.g., sycophancy, adaptation through in-context learning). These interactions may produce bidirectional belief amplification, potentially destabilizing users’ beliefs and mental health.

The authors note that pre-deployment safety testing and content filters may fail to generalize to the diversity of real-world language use and extended, personalized interactions, raising the possibility that some failure modes will only become apparent post-deployment. The unresolved issue is how much these theoretical risks will actually surface in practice, particularly before large-scale adoption provides sufficient observational evidence.

References

The degree to which these theoretical risks will manifest is not known, and may potentially be unknowable prior to widespread general population adoption.

— Technological folie à deux: Feedback Loops Between AI Chatbots and Mental Illness (2507.19218 - Dohnány et al., 25 Jul 2025) in Section 3: Feedback loops and technological folie à deux

Extent of real-world manifestation of theoretical human–chatbot risk dynamics

Background

References

Related Problems