Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
51 tokens/sec
GPT-4o
60 tokens/sec
Gemini 2.5 Pro Pro
44 tokens/sec
o3 Pro
8 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

A Computational Framework for Behavioral Assessment of LLM Therapists (2401.00820v2)

Published 1 Jan 2024 in cs.CL and cs.HC

Abstract: The emergence of LLMs like ChatGPT has increased interest in their use as therapists to address mental health challenges and the widespread lack of access to care. However, experts have emphasized the critical need for systematic evaluation of LLM-based mental health interventions to accurately assess their capabilities and limitations. Here, we propose BOLT, a proof-of-concept computational framework to systematically assess the conversational behavior of LLM therapists. We quantitatively measure LLM behavior across 13 psychotherapeutic approaches with in-context learning methods. Then, we compare the behavior of LLMs against high- and low-quality human therapy. Our analysis based on Motivational Interviewing therapy reveals that LLMs often resemble behaviors more commonly exhibited in low-quality therapy rather than high-quality therapy, such as offering a higher degree of problem-solving advice when clients share emotions. However, unlike low-quality therapy, LLMs reflect significantly more upon clients' needs and strengths. Our findings caution that LLM therapists still require further research for consistent, high-quality care.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (77)
  1. Noor Al-Sibai. 2023. OPENAI EMPLOYEE SAYS SHE’S NEVER TRIED THERAPY BUT CHATGPT IS PRETTY MUCH A REPLACEMENT FOR IT, accessed 2023.
  2. Large-scale analysis of counseling conversations: An application of natural language processing to mental health. Transactions of the Association for Computational Linguistics.
  3. Kyle Arnold. 2014. Behind the mirror: Reflective listening and its tain in the work of carl rogers. The Humanistic Psychologist, 42(4):354–369.
  4. Scaling up the evaluation of psychotherapy: evaluating motivational interviewing fidelity via statistical text classification. Implementation Science.
  5. Training a helpful and harmless assistant with reinforcement learning from human feedback. arXiv preprint arXiv:2204.05862.
  6. Christopher Bail. 2023. Can generative artificial intelligence improve social science?
  7. The dissemination of empirically supported treatments: a view to the future. Behaviour Research and Therapy.
  8. Aaron T Beck. 1976. Cognitive therapy and the emotional disorders. International Universities Press.
  9. Psychoeducation: A measure to strengthen psychiatric treatment. Delhi Psychiatry Journal, 14(1):33–39.
  10. Edward S Bordin. 1979. The generalizability of the psychoanalytic concept of the working alliance. Psychotherapy: Theory, research & practice.
  11. The development and psychometric properties of liwc-22. Austin, TX: University of Texas at Austin.
  12. Alain Braillon and Françoise Taiebi. 2020. Practicing “reflective listening” is a mandatory prerequisite for empathy. Patient education and counseling, 103(9):1866–1867.
  13. Ryan Broderick. 2023. People are using AI for therapy, whether the tech is ready for it or not, accessed 2023.
  14. Language models are few-shot learners. NeurIPS.
  15. Sparks of artificial general intelligence: Early experiments with gpt-4. arXiv preprint arXiv:2303.12712.
  16. Behavioral activation and depression symptomatology: longitudinal assessment of linguistic indicators in text-based therapy sessions. JMIR.
  17. Observing dialogue in therapy: Categorizing and forecasting behavioral codes. arXiv preprint arXiv:1907.00326.
  18. Rating motivational interviewing fidelity from thin slices. Psychology of Addictive Behaviors.
  19. Psychological change from the inside looking out: A qualitative investigation. Counselling and Psychotherapy Research, 7(3):178–187.
  20. Challenges of large language models for mental health counseling. arXiv preprint arXiv:2311.13857.
  21. The future landscape of large language models in medicine. Nature Communications Medicine.
  22. Jeff L Cochran and Nancy H Cochran. 2015. The heart of counseling: Counseling skills through therapeutic relationships. Routledge.
  23. Evaluation of gpt-3.5 and gpt-4 for supporting real-world information needs in healthcare delivery. arXiv preprint arXiv:2304.13714.
  24. Benefits and harms of large language models in digital mental health.
  25. Chatbots and mental health: Insights into the safety of generative ai. Journal of Consumer Psychology.
  26. Using large language models in psychology. Nature Reviews Psychology, pages 1–14.
  27. Mindy Duffourc and Sara Gerke. 2023. Generative ai in health care and liability risks for physicians and safety concerns for patients. Jama.
  28. Christopher G Fairburn and Zafra Cooper. 2011. Therapist competence, therapy quality, and therapist training. Behaviour research and therapy.
  29. Stephen B Fawcett and Leslie Borck-Jameson. 2014. Learning counseling and problem-solving skills. Routledge.
  30. Jose Hamilton. 2023. Why Generative AI (LLM) Is Ready for Mental Healthcare, accessed 2023.
  31. Adam O Horvath. 2001. The alliance. Psychotherapy: Theory, research, practice, training.
  32. Adam O Horvath and Leslie S Greenberg. 1989. Development and validation of the working alliance inventory. Journal of counseling psychology.
  33. Helping the helper: Supporting peer counselors via ai-empowered practice and feedback. arXiv preprint arXiv:2305.08982.
  34. A survey on hallucination in large language models: Principles, taxonomy, challenges, and open questions. arXiv preprint arXiv:2311.05232.
  35. What is behavioral activation?: A review of the empirical literature. Clinical psychology review, 30(6):608–620.
  36. Michael J Lambert. 2013. Bergin and Garfield’s handbook of psychotherapy and behavior change. John Wiley & Sons.
  37. Identifying therapist conversational actions across diverse psychotherapeutic approaches. In CLPsych Workshop, ACL, Minneapolis, Minnesota.
  38. Large language models understand and can be enhanced by emotional stimuli.
  39. Developing a delivery science for artificial intelligence in healthcare. NPJ digital medicine.
  40. Gendered mental health stigma in masked language models. In EMNLP.
  41. Lars-Gunnar Lundh. 2019. Three modes of psychotherapy and their requisite core skills. Counselling and Psychotherapy Research, 19(4):399–408.
  42. Speaker and time-aware joint contextual learning for dialogue-act classification in counselling conversations. In WSDM.
  43. Ethical issues for direct-to-consumer digital psychotherapy apps: addressing accountability, data protection, and consent. JMIR mental health.
  44. Manual for the motivational interviewing skill code (misc). Unpublished manuscript. Albuquerque: Center on Alcoholism, Substance Abuse and Addictions, University of New Mexico.
  45. William R Miller and Stephen Rollnick. 2012. Motivational interviewing: Helping people change. Guilford press.
  46. Richard Nelson-Jones. 2013. Practical counselling and helping skills: text and activities for the lifeskills counselling model. Practical Counselling and Helping Skills, pages 1–528.
  47. OpenAI. 2023. Gpt-4 technical report. ArXiv, abs/2303.08774.
  48. Training language models to follow instructions with human feedback. Advances in Neural Information Processing Systems, 35:27730–27744.
  49. From treatment to healing: envisioning a decolonial digital mental health. In Proceedings of the 2022 CHI Conference on Human Factors in Computing Systems, pages 1–23.
  50. Verve: Template-based reflective rewriting for motivational interviewing. In EMNLP.
  51. Pair: Prompt-aware margin ranking for counselor reflection scoring in motivational interviewing. In EMNLP.
  52. What makes a good counselor? learning to distinguish between high-quality and low-quality counseling conversations. In ACL.
  53. Reflective listening in counseling: effects of training time and evaluator social skills. American journal of psychotherapy, 61(2):191–209.
  54. Reddit-1. 2023. Using ChatGPT as a therapist?, accessed 2023.
  55. Reddit-2. 2023. ChatGPT is better than my therapist, holy shit, accessed 2023.
  56. Empathy. Psychotherapy.
  57. Whose opinions do language models reflect? ArXiv, abs/2303.17548.
  58. Grounding or guesswork? large language models are presumptive grounders. arXiv preprint arXiv:2311.09144.
  59. Human-AI collaboration enables more empathic conversations in text-based peer-to-peer mental health support. Nature Machine Intelligence.
  60. Cognitive reframing of negative thoughts through human-language model interaction. In ACL.
  61. Facilitating self-guided mental health interventions through human-language model interaction: A case study of cognitive restructuring. ArXiv, abs/2310.15461.
  62. Towards understanding sycophancy in language models. arXiv preprint arXiv:2310.13548.
  63. Mental health stigma update: A review of consequences. Advances in Mental Health.
  64. The multitheoretical list of therapeutic interventions–30 items (multi-30). Psychotherapy Research, 29(5):565–580.
  65. Artificial intelligence will change the future of psychotherapy: A proposal for responsible, psychologist-led development.
  66. Dialogue act modeling for automatic tagging and recognition of conversational speech. Computational Linguistics, 26(3):339–374.
  67. The chatgpt therapist will see you now: Navigating generative artificial intelligence’s potential in addiction medicine research and patient care.
  68. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288.
  69. Restructuring insight: An integrative review of insight in problem-solving, meditation, psychotherapy, delusions and psychedelics. Consciousness and cognition, 110:103494.
  70. Twitter. 2023. GPT is a better therapist than any therapist I’ve ever tried, accessed 2023.
  71. Testing the integrity of a psychotherapy protocol: assessment of adherence and competence. Journal of consulting and clinical psychology.
  72. Ethical and social risks of harm from language models. arXiv preprint arXiv:2112.04359.
  73. Joseph Weizenbaum. 1966. Eliza—a computer program for the study of natural language communication between man and machine. Communications of the ACM.
  74. Leveraging large language models for mental health prediction via online text data. arXiv preprint arXiv:2307.14385.
  75. Youper. 2023. Mental Health GPTs, accessed 2023.
  76. Why johnny can’t prompt: how non-ai experts try (and fail) to design llm prompts. In CHI.
  77. Can large language models transform computational social science? Computational Linguistics.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (4)
  1. Yu Ying Chiu (9 papers)
  2. Ashish Sharma (27 papers)
  3. Inna Wanyin Lin (5 papers)
  4. Tim Althoff (64 papers)
Citations (24)