Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
51 tokens/sec
GPT-4o
60 tokens/sec
Gemini 2.5 Pro Pro
44 tokens/sec
o3 Pro
8 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Towards Interpretable Mental Health Analysis with Large Language Models (2304.03347v4)

Published 6 Apr 2023 in cs.CL
Towards Interpretable Mental Health Analysis with Large Language Models

Abstract: The latest LLMs such as ChatGPT, exhibit strong capabilities in automated mental health analysis. However, existing relevant studies bear several limitations, including inadequate evaluations, lack of prompting strategies, and ignorance of exploring LLMs for explainability. To bridge these gaps, we comprehensively evaluate the mental health analysis and emotional reasoning ability of LLMs on 11 datasets across 5 tasks. We explore the effects of different prompting strategies with unsupervised and distantly supervised emotional information. Based on these prompts, we explore LLMs for interpretable mental health analysis by instructing them to generate explanations for each of their decisions. We convey strict human evaluations to assess the quality of the generated explanations, leading to a novel dataset with 163 human-assessed explanations. We benchmark existing automatic evaluation metrics on this dataset to guide future related works. According to the results, ChatGPT shows strong in-context learning ability but still has a significant gap with advanced task-specific methods. Careful prompt engineering with emotional cues and expert-written few-shot examples can also effectively improve performance on mental health analysis. In addition, ChatGPT generates explanations that approach human performance, showing its great potential in explainable mental health analysis.

Towards Interpretable Mental Health Analysis with LLMs

The paper "Towards Interpretable Mental Health Analysis with LLMs" conducts a comprehensive empirical paper on the capabilities of LLMs, specifically ChatGPT, for mental health analysis and emotional reasoning. This paper evaluates LLMs across multiple datasets and tasks, identifying limitations in current approaches and suggesting enhancements through improved prompt engineering.

The authors focus on three main research questions: the general performance of LLMs in mental health analysis, the impact of prompting strategies on these models, and the ability of LLMs to generate interpretable explanations for their decisions. The paper evaluates four prominent LLMs, including ChatGPT, InstructGPT-3, and LLaMA-7B/13B, using eleven datasets covering binary and multi-class mental health condition detection, cause/factor detection, emotion recognition in conversations, and causal emotion entailment.

Key Findings and Results

  1. Overall Performance: ChatGPT demonstrates superior performance compared to other benchmark LLMs, such as LLaMA and InstructGPT-3, across all tasks. However, it still significantly underperforms when compared to state-of-the-art supervised methods, indicating challenges in emotion-related and subjective tasks.
  2. Prompting Strategies: The paper highlights that prompt engineering is crucial for enhancing the mental health analysis capabilities of LLMs. The adoption of emotion-enhanced Chain-of-Thought (CoT) prompting strategies considerably boosts ChatGPT’s performance. Few-shot learning with expert-written examples further increases efficacy, particularly for complex tasks.
  3. Explainability: ChatGPT shows promise in generating near-human explanations for its predictions, underlining its potential for explainable AI applications in mental health. However, the paper also addresses limitations tied to ChatGPT’s sensitivity to prompt variations and inaccuracies in reasoning, which can lead to unstable or erroneous predictions.
  4. Automatic Evaluation: The authors develop a newly annotated dataset, facilitating evaluations of LLM-generated explanations. They benchmark automatic evaluation metrics against human annotations, finding that existing metrics correlate moderately with human judgements but require further customization for explainable mental health analysis.

Methodological Approaches

The paper employs various prompts, including zero-shot, emotion-enhanced, and few-shot variants, to examine the effect of context provision on model performance. The authors further perform strict human evaluations, creating a novel dataset of explanations, which serves as a foundation for future work on explainability in mental health analysis.

Implications and Future Work

The paper demonstrates the potential of LLMs, especially ChatGPT, in tasks requiring sophisticated emotional comprehension and decision-making transparency. This research serves as a call to action for further development in prompt engineering, the incorporation of expert knowledge, and potentially domain-specific fine-tuning of LLMs, which could address existing limitations in emotional reasoning and prediction stability.

The implications for AI in mental health are significant, pointing toward future AI systems that can support healthcare professionals by providing accurate and interpretable analyses of mental health conditions. Nonetheless, the paper acknowledges ethical considerations and emphasizes the need for caution in deploying such systems in real-world scenarios due to their current limitations.

Overall, this research makes substantial contributions to the field by evaluating the explainability and generalization capabilities of LLMs within the context of mental health, highlighting the importance of emotional cues and the potential for AI to contribute positively in this sensitive and impactful domain.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (80)
  1. Transfer learning for depression: Early detection and severity prediction from social media postings. In CLEF (Working Notes).
  2. Will affective computing emerge from foundation models and general ai? a first evaluation on chatgpt. arXiv preprint arXiv:2303.03186.
  3. A multitask, multilingual, multimodal evaluation of chatgpt on reasoning, hallucination, and interactivity. arXiv preprint arXiv:2302.04023.
  4. James RA Benoit. 2023. Chatgpt for clinical vignette generation, revision, and evaluation. medRxiv, pages 2023–02.
  5. Ethical research protocols for social media health research. In Proceedings of the first ACL workshop on ethics in natural language processing, pages 94–102.
  6. Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901.
  7. Amy Bruckman. 2002. Studying the amateur artist: A perspective on disguising data collected in human subjects research on the internet. Ethics and Information Technology, 4:217–231.
  8. IEMOCAP: interactive emotional dyadic motion capture database. Lang. Resour. Evaluation, 42(4):335–359.
  9. How robust is gpt-3.5 to predecessors? a comprehensive study on language understanding tasks. arXiv preprint arXiv:2303.00293.
  10. Learning phrase representations using rnn encoder-decoder for statistical machine translation. In EMNLP.
  11. Clpsych 2015 shared task: Depression and ptsd on twitter. In Proceedings of the 2nd Workshop on CLPsych, pages 31–39.
  12. BERT: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 4171–4186, Minneapolis, Minnesota. Association for Computational Linguistics.
  13. Msˆ2: Multi-document summarization of medical studies. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 7494–7513.
  14. Statistical methods for rates and proportions. john wiley & sons.
  15. Gptscore: Evaluate as you desire. arXiv preprint arXiv:2302.04166.
  16. SimCSE: Simple contrastive learning of sentence embeddings. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 6894–6910, Online and Punta Cana, Dominican Republic. Association for Computational Linguistics.
  17. CAMS: An Annotated Corpus for Causal Analysis of Mental Health Issues in Social Media Posts. In Proceedings of the Thirteenth Language Resources and Evaluation Conference, pages 6387–6396.
  18. Hierarchical attention network for explainable depression detection on Twitter aided by metaphor concept mappings. In Proceedings of the 29th International Conference on Computational Linguistics, pages 94–104, Gyeongju, Republic of Korea. International Committee on Computational Linguistics.
  19. Clayton Hutto and Eric Gilbert. 2014. Vader: A parsimonious rule-based model for sentiment analysis of social media text. In Proceedings of the international AAAI conference on web and social media, volume 8, pages 216–225.
  20. Shaoxiong Ji. 2022. Towards intention understanding in suicidal risk assessment with natural language processing. In Findings of EMNLP, pages 4028–4038.
  21. Suicidal ideation and mental disorder detection with attentive relation networks. Neural Computing and Applications, 34:10309–10319.
  22. Mentalbert: Publicly available pretrained language models for mental healthcare. In Proceedings of the Thirteenth Language Resources and Evaluation Conference, pages 7184–7190. European Language Resources Association (ELRA).
  23. Detection of mental health from Reddit via deep contextualized representations. In Proceedings of the 11th International Workshop on Health Text Mining and Information Analysis, pages 147–156, Online. Association for Computational Linguistics.
  24. Is chatgpt a good translator? a preliminary study. arXiv preprint arXiv:2301.08745.
  25. Bag of tricks for efficient text classification. In Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 2, Short Papers, pages 427–431.
  26. Suicidal ideation and the subjective aspects of depression. Journal of affective disorders, 140(1):75–81.
  27. Yoon Kim. 2014. Convolutional neural networks for sentence classification. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 1746–1751, Doha, Qatar. Association for Computational Linguistics.
  28. Chatgpt: Jack of all trades, master of none. arXiv preprint arXiv:2302.10724.
  29. Large language models are zero-shot reasoners. In Advances in Neural Information Processing Systems.
  30. Bishal Lamichhane. 2023. Evaluation of chatgpt for nlp-based mental health applications. arXiv preprint arXiv:2303.15727.
  31. Neutral utterances are also causes: Enhancing conversational causal emotion entailment with social commonsense knowledge. In Proceedings of the Thirty-First International Joint Conference on Artificial Intelligence, IJCAI 2022, Vienna, Austria, 23-29 July 2022, pages 4209–4215. ijcai.org.
  32. DailyDialog: A manually labelled multi-turn dialogue dataset. In Proceedings of the Eighth International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pages 986–995, Taipei, Taiwan. Asian Federation of Natural Language Processing.
  33. Sensemood: depression detection on social media. In Proceedings of the 2020 international conference on multimedia retrieval, pages 407–411.
  34. Chin-Yew Lin. 2004. Rouge: A package for automatic evaluation of summaries. In Text summarization branches out, pages 74–81.
  35. Roberta: A robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692.
  36. Chatgpt as a factual inconsistency evaluator for abstractive text summarization. arXiv preprint arXiv:2303.15621.
  37. Dialoguernn: An attentive RNN for emotion detection in conversations. In The Thirty-Third AAAI Conference on Artificial Intelligence, AAAI 2019, The Thirty-First Innovative Applications of Artificial Intelligence Conference, IAAI 2019, The Ninth AAAI Symposium on Educational Advances in Artificial Intelligence, EAAI 2019, Honolulu, Hawaii, USA, January 27 - February 1, 2019, pages 6818–6825. AAAI Press.
  38. Sad: A stress annotated dataset for recognizing everyday stressors in sms-like conversational systems. In Extended abstracts of the 2021 CHI conference on human factors in computing systems, pages 1–7.
  39. Saif Mohammad and Peter Turney. 2010. Emotions evoked by common words and phrases: Using Mechanical Turk to create an emotion lexicon. In Proceedings of the NAACL HLT 2010 Workshop on Computational Approaches to Analysis and Generation of Emotion in Text, pages 26–34, Los Angeles, CA. Association for Computational Linguistics.
  40. Saif M. Mohammad and Peter D. Turney. 2013. Crowdsourcing a word-emotion association lexicon. Computational Intelligence, 29(3):436–465.
  41. Improving the generalizability of depression detection by leveraging clinical questionnaires. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 8446–8459, Dublin, Ireland. Association for Computational Linguistics.
  42. Ethics and privacy in social media research for mental health. Current psychiatry reports, 22:1–7.
  43. OpenAI. 2023. Gpt-4 technical report.
  44. Training language models to follow instructions with human feedback. Advances in Neural Information Processing Systems, 35:27730–27744.
  45. Bleu: a method for automatic evaluation of machine translation. In Proceedings of the 40th annual meeting of the Association for Computational Linguistics, pages 311–318.
  46. Inna Pirina and Çağrı Çöltekin. 2018. Identifying depression on Reddit: The effect of training data. In Proceedings of the 2018 EMNLP Workshop SMM4H: The 3rd Social Media Mining for Health Applications Workshop & Shared Task, pages 9–12.
  47. Context-dependent sentiment analysis in user-generated videos. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 873–883, Vancouver, Canada. Association for Computational Linguistics.
  48. MELD: A multimodal multi-party dataset for emotion recognition in conversations. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 527–536, Florence, Italy. Association for Computational Linguistics.
  49. Recognizing emotion cause in conversations. Cogn. Comput., 13(5):1317–1332.
  50. Emotion recognition in conversation: Research challenges, datasets, and recent advances. IEEE Access, 7:100943–100953.
  51. Cornelius Puschman. 2017. Bad judgment, bad ethics? Internet Research Ethics for the Social Age, 95.
  52. Is chatgpt a general-purpose natural language processing task solver? arXiv preprint arXiv:2302.06476.
  53. Multimodal fusion of bert-cnn and gated cnn representations for depression detection. In Proceedings of the 9th International on Audio/Visual Emotion Challenge and Workshop, pages 55–63.
  54. Dialogxl: All-in-one xlnet for multi-party conversation emotion recognition. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 35, pages 13789–13797.
  55. Directed acyclic graph network for conversational emotion recognition. In ACL, pages 1551–1560. Association for Computational Linguistics.
  56. Supervised prototypical contrastive learning for emotion recognition in conversation. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 5197–5206, Abu Dhabi, United Arab Emirates. Association for Computational Linguistics.
  57. Learning to summarize with human feedback. Advances in Neural Information Processing Systems, 33:3008–3021.
  58. Audibert: A deep transfer learning multimodal classification framework for depression screening. In Proceedings of the 30th ACM international conference on information & knowledge management, pages 4145–4154.
  59. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971.
  60. Elsbeth Turcan and Kathleen McKeown. 2019. Dreaddit: A Reddit Dataset for Stress Analysis in Social Media. In Proceedings of the Tenth International Workshop on Health Text Mining and Information Analysis (LOUHI 2019), pages 97–107.
  61. Generating (factual?) narrative summaries of rcts: Experiments with neural multi-document summarization. AMIA Summits on Translational Science Proceedings, 2021:605.
  62. Chain of thought prompting elicits reasoning in large language models. arXiv preprint arXiv:2201.11903.
  63. Effective inter-clause modeling for end-to-end emotion-cause pair extraction. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 3171–3181, Online. Association for Computational Linguistics.
  64. Knowledge-interactive network with sentiment polarity intensity-aware multi-task learning for emotion recognition in conversations. In Findings of the Association for Computational Linguistics: EMNLP 2021, pages 2879–2889, Punta Cana, Dominican Republic. Association for Computational Linguistics.
  65. Cluster-level contrastive learning for emotion recognition in conversations. IEEE Transactions on Affective Computing, pages 1–12.
  66. A mental state knowledge–aware and contrastive network for early stress and depression detection on social media. Information Processing & Management, 59(4):102961.
  67. Xlnet: Generalized autoregressive pretraining for language understanding. Advances in neural information processing systems, 32.
  68. A comprehensive capability analysis of gpt-3 and gpt-3.5 series models. arXiv preprint arXiv:2303.10420.
  69. Zero-shot temporal relation extraction with chatgpt. arXiv preprint arXiv:2304.05454.
  70. Bartscore: Evaluating generated text as text generation. Advances in Neural Information Processing Systems, 34:27263–27277.
  71. Sayyed M Zahiri and Jinho D Choi. 2017. Emotion detection on tv show transcripts with sequence-based convolutional neural networks. arXiv preprint arXiv:1708.04299.
  72. Natural language processing applied to mental illness detection: a narrative review. NPJ digital medicine, 5(1):46.
  73. Emotion fusion for mental illness detection from social media: A survey. Information Fusion, 92:231–246.
  74. Bertscore: Evaluating text generation with BERT. In 8th International Conference on Learning Representations, ICLR 2020, Addis Ababa, Ethiopia, April 26-30, 2020. OpenReview.net.
  75. Psychiatric scale guided risky post screening for early detection of depression. In Proceedings of the Thirty-First International Joint Conference on Artificial Intelligence, IJCAI 2022, Vienna, Austria, 23-29 July 2022, pages 5220–5226.
  76. Knowledge-bridged causal interaction network for causal emotion entailment. arXiv preprint arXiv:2212.02995.
  77. Knowledge-enriched transformer for emotion detection in textual conversations. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 165–176, Hong Kong, China. Association for Computational Linguistics.
  78. Can chatgpt understand too? a comparative study on chatgpt and fine-tuned bert. arXiv preprint arXiv:2302.10198.
  79. A c-lstm neural network for text classification. arXiv preprint arXiv:1511.08630.
  80. Attention-based bidirectional long short-term memory networks for relation classification. In Proceedings of the 54th annual meeting of the association for computational linguistics (volume 2: Short papers), pages 207–212.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (6)
  1. Kailai Yang (22 papers)
  2. Shaoxiong Ji (39 papers)
  3. Tianlin Zhang (17 papers)
  4. Qianqian Xie (60 papers)
  5. Ziyan Kuang (4 papers)
  6. Sophia Ananiadou (72 papers)
Citations (47)