Papers

Topics

Authors

Recent

View all

Gemini 2.5 Flash

51 tokens/sec

GPT-4o

60 tokens/sec

Gemini 2.5 Pro Pro

44 tokens/sec

o3 Pro

8 tokens/sec

GPT-4.1 Pro

50 tokens/sec

DeepSeek R1 via Azure Pro

28 tokens/sec

2000 character limit reached

Can AI Relate: Testing Large Language Model Response for Mental Health Support (2405.12021v2)

Published 20 May 2024 in cs.CL

Abstract: LLMs are already being piloted for clinical use in hospital systems like NYU Langone, Dana-Farber and the NHS. A proposed deployment use case is psychotherapy, where a LLM-powered chatbot can treat a patient undergoing a mental health crisis. Deployment of LLMs for mental health response could hypothetically broaden access to psychotherapy and provide new possibilities for personalizing care. However, recent high-profile failures, like damaging dieting advice offered by the Tessa chatbot to patients with eating disorders, have led to doubt about their reliability in high-stakes and safety-critical settings. In this work, we develop an evaluation framework for determining whether LLM response is a viable and ethical path forward for the automation of mental health treatment. Our framework measures equity in empathy and adherence of LLM responses to motivational interviewing theory. Using human evaluation with trained clinicians and automatic quality-of-care metrics grounded in psychology research, we compare the responses provided by peer-to-peer responders to those provided by a state-of-the-art LLM. We show that LLMs like GPT-4 use implicit and explicit cues to infer patient demographics like race. We then show that there are statistically significant discrepancies between patient subgroups: Responses to Black posters consistently have lower empathy than for any other demographic group (2%-13% lower than the control group). Promisingly, we do find that the manner in which responses are generated significantly impacts the quality of the response. We conclude by proposing safety guidelines for the potential deployment of LLMs for mental health response.

PDF HTML Abstract

Evaluating LLMs in Mental Health Settings: Equity and Quality of Care Concerns

Background

LLMs like GPT-4 have been making waves in the healthcare sector. Their ability to understand and generate human language has opened up a number of exciting possibilities, but it's not all smooth sailing. This research paper dives deep into one particularly sensitive area: the use of LLMs for mental health response. The idea is that these models could help provide scalable, on-demand therapy, which sounds incredible given the ongoing mental health crisis. However, concerns about their reliability and potential biases remain a hot topic.

Clinical Use of LLMs

LLMs have started to pop up in various clinical settings—from generating clinical notes to responding to patient queries. But mental health is a different kind of challenge altogether. The paper focuses on evaluating whether LLMs like GPT-4 can provide mental health support that's both ethical and effective.

There were some unfortunate high-profile incidents where chatbots provided harmful advice, raising questions about the viability of using LLMs in critical settings. To tackle this, the researchers developed a comprehensive evaluation framework that looks at the quality and equity of mental health responses by GPT-4 compared to human peer-to-peer responders.

Key Findings

Clinical Evaluation

Empathy: GPT-4 fared reasonably well in some areas, even outperforming human peer responses on specific empathy metrics. But there were caveats: clinicians noted that GPT-4 often felt impersonal and overly direct. This lack of "lived experience," which human responders naturally provide, could make interactions less meaningful for those seeking help.

Emotional Reaction: GPT-4 often exhibited more empathy in emotional reactions (0.86 vs. 0.23 for humans according to one clinician).
Exploration: Explored patient’s feelings more effectively (0.43 vs. 0.27).

Encouragement for Positive Change: GPT-4 scored higher on encouraging patients towards positive behavior change (3.08 vs. 2.08 for humans). This generally positive feedback indicates that LLMs can be a worthwhile tool if equity concerns are addressed.

Bias Evaluation

The evaluation also looked into whether GPT-4 was providing equitable care across different demographic groups. This is where things get trickier:

Demographic Inferences: GPT-4 could infer patient demographics like race and gender from the content of social media posts.
Empathy Discrepancies: Unfortunately, the responses were not always consistent. Black and Asian posters received significantly less empathetic responses compared to their White counterparts or when the race was unknown.

Black posters: Empathy was 2%-15% lower.
Asian posters: Empathy was 5%-17% lower.

Addressing Bias

The researchers looked into potential fixes and found that explicitly instructing GPT-4 to consider demographic attributes could mitigate some bias. This isn't a complete solution but it's a step in the right direction.

Practical Implications and Future Directions

So, where do we go from here? The findings suggest LLMs like GPT-4 have the potential to aid in mental health response but come with substantial caveats:

Equity Must Be Ensured: The biases identified in the paper are alarming but also crucial for shaping future developments. Ensuring equitable care is paramount.
Guidelines Needed: Concrete guidelines and ethical frameworks need to be established for deploying LLMs in mental health settings.
Further Research: The dataset and code are being released for further research, enabling the AI community to build on these important findings.

Conclusion

This paper provides vital insights into the use of LLMs for mental health care. While LLMs like GPT-4 have shown promise in delivering empathetic and effective responses, the paper highlights significant issues related to bias and equity. These challenges emphasize the need for ongoing vigilance, improved guidelines, and continuous research to ensure that these advanced technologies serve all individuals fairly and effectively.

PDF Markdown Bookmark Chat (Pro)

References (61)

Authors (5)

Saadia Gabriel (23 papers)
Isha Puri (7 papers)
Xuhai Xu (38 papers)
Matteo Malgaroli (3 papers)
Marzyeh Ghassemi (96 papers)

Citations (5)

View on Semantic Scholar

Tweets

https://twitter.com/GabrielSaadia/status/1792979152074604745

https://twitter.com/GptMaestro/status/1793285475253559448

https://twitter.com/BomTarry/status/1793572186919432422

https://twitter.com/MLamparth/status/1794411232579244118