Assessing Empathy in Large Language Models with Real-World Physician-Patient Interactions (2405.16402v1)
Abstract: The integration of LLMs into the healthcare domain has the potential to significantly enhance patient care and support through the development of empathetic, patient-facing chatbots. This study investigates an intriguing question Can ChatGPT respond with a greater degree of empathy than those typically offered by physicians? To answer this question, we collect a de-identified dataset of patient messages and physician responses from Mayo Clinic and generate alternative replies using ChatGPT. Our analyses incorporate novel empathy ranking evaluation (EMRank) involving both automated metrics and human assessments to gauge the empathy level of responses. Our findings indicate that LLM-powered chatbots have the potential to surpass human physicians in delivering empathetic communication, suggesting a promising avenue for enhancing patient care and reducing professional burnout. The study not only highlights the importance of empathy in patient interactions but also proposes a set of effective automatic empathy ranking metrics, paving the way for the broader adoption of LLMs in healthcare.
- Gpt-4 technical report. arXiv preprint arXiv:2303.08774, 2023.
- Artificial intelligence (ai) chatbots in medicine: A supplement, not a substitute. Cureus, 15(6), 2023.
- Comparing Physician and Artificial Intelligence Chatbot Responses to Patient Questions Posted to a Public Social Media Forum. JAMA Internal Medicine, 183(6):589–596, 06 2023.
- Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901, 2020.
- Weak-to-strong generalization: Eliciting strong capabilities with weak supervision. arXiv preprint arXiv:2312.09390, 2023.
- Iemocap: Interactive emotional dyadic motion capture database. Language resources and evaluation, 42:335–359, 2008.
- Carlos Carrasco-Farré. The fingerprints of misinformation: how deceptive content differs from reliable sources in terms of cognitive effort and appeal to emotions. Humanities and Social Sciences Communications, 9(1):1–18, 2022.
- Understanding emotions in text using deep learning and big data. Computers in Human Behavior, 93:309–317, 2019.
- Socially responsible ai algorithms: Issues, purposes, and challenges. Journal of Artificial Intelligence Research, 71:1137–1181, 2021.
- A large language model-based generative natural language processing framework finetuned on clinical notes accurately extracts headache frequency from electronic health records. medRxiv.
- Chatgpt in medicine: an overview of its applications, advantages, limitations, future prospects, and ethical considerations. Frontiers in Artificial Intelligence, 6:1169595, 2023.
- Perplexity–a new predictor of cognitive changes in spoken language?–results of the interdisciplinary longitudinal study on adult development and aging (ilse). Linguistics Vanguard, 5(s2):20180026, 2019.
- If in a crowdsourced data annotation pipeline, a gpt-4. In Proceedings of the CHI Conference on Human Factors in Computing Systems, pages 1–25, 2024.
- It takes two to empathize: One to seek and one to provide. In Proceedings of the AAAI conference on artificial intelligence, volume 35, pages 13018–13026, 2021.
- An empathy-driven, conversational artificial intelligence agent (wysa) for digital mental well-being: real-world data evaluation mixed-methods study. JMIR mHealth and uHealth, 6(11):e12106, 2018.
- Perspective-taking and pragmatics for generating empathetic responses focused on emotion causes. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 2227–2240, 2021.
- Biobert: a pre-trained biomedical language representation model for biomedical text mining. Bioinformatics, 36(4):1234–1240, 2020.
- Improving contextual coherence in variational personalized and empathetic dialogue agents. In ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 7052–7056. IEEE, 2022.
- Does gpt-3 generate empathetic dialogues? a novel in-context example selection method and automatic evaluation metric for empathetic dialogue generation. In Proceedings of the 29th International Conference on Computational Linguistics, pages 669–683, 2022.
- Acute-eval: Improved dialogue evaluation with optimized questions and multi-turn comparisons. arXiv preprint arXiv:1909.03087, 2019.
- How not to evaluate your dialogue system: An empirical study of unsupervised evaluation metrics for dialogue response generation. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, pages 2122–2132, 2016.
- Roberta: A robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692, 2019.
- Improving biomedical information retrieval with neural retrievers. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 36, pages 11038–11046, 2022.
- Biogpt: generative pre-trained transformer for biomedical text generation and mining. Briefings in Bioinformatics, 23(6):bbac409, 2022.
- Dr. icl: Demonstration-retrieved in-context learning. arXiv preprint arXiv:2305.14128, 2023.
- In-context learning with retrieved demonstrations for language models: A survey. arXiv preprint arXiv:2401.11624, 2024.
- Mddial: A multi-turn differential diagnosis dialogue dataset with reliability evaluation. arXiv e-prints, pages arXiv–2308, 2023.
- Characteristics of physician empathetic statements during pediatric intensive care conferences with family members: a qualitative study. JAMA network open, 1(3):e180351–e180351, 2018.
- In-boxbart: Get instructions into biomedical multi-task learning. In Findings of the Association for Computational Linguistics: NAACL 2022, pages 112–128, 2022.
- Towards empathetic open-domain conversation models: A new benchmark and dataset. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 5370–5381, 2019.
- Partha Pratim Ray. Chatgpt: A comprehensive review on background, applications, key challenges, bias, ethics, limitations and future scope. Internet of Things and Cyber-Physical Systems, 2023.
- A computational approach to understanding empathy expressed in text-based mental health support. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 5263–5276, 2020.
- Artificial empathy classification: A survey of deep learning techniques, datasets, and evaluation scales. arXiv preprint arXiv:2310.00010, 2023.
- Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288, 2023.
- Emergent abilities of large language models. Transactions on Machine Learning Research, 2022.
- A taxonomy of empathetic response intents in human social conversations. In Proceedings of the 28th International Conference on Computational Linguistics, pages 4886–4899, 2020.
- A large-scale dataset for empathetic response generation. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 1251–1264, 2021.
- The development and use of chatbots in public health: scoping review. JMIR human factors, 9(4):e35882, 2022.
- Man Luo (55 papers)
- Christopher J. Warren (1 paper)
- Lu Cheng (73 papers)
- Haidar M. Abdul-Muhsin (1 paper)
- Imon Banerjee (41 papers)