Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
102 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Large Language Models for Mental Health Diagnostic Assessments: Exploring The Potential of Large Language Models for Assisting with Mental Health Diagnostic Assessments -- The Depression and Anxiety Case (2501.01305v1)

Published 2 Jan 2025 in cs.CL

Abstract: LLMs are increasingly attracting the attention of healthcare professionals for their potential to assist in diagnostic assessments, which could alleviate the strain on the healthcare system caused by a high patient load and a shortage of providers. For LLMs to be effective in supporting diagnostic assessments, it is essential that they closely replicate the standard diagnostic procedures used by clinicians. In this paper, we specifically examine the diagnostic assessment processes described in the Patient Health Questionnaire-9 (PHQ-9) for major depressive disorder (MDD) and the Generalized Anxiety Disorder-7 (GAD-7) questionnaire for generalized anxiety disorder (GAD). We investigate various prompting and fine-tuning techniques to guide both proprietary and open-source LLMs in adhering to these processes, and we evaluate the agreement between LLM-generated diagnostic outcomes and expert-validated ground truth. For fine-tuning, we utilize the Mentalllama and Llama models, while for prompting, we experiment with proprietary models like GPT-3.5 and GPT-4o, as well as open-source models such as llama-3.1-8b and mixtral-8x7b.

The paper "Exploring The Potential of LLMs for Assisting with Mental Health Diagnostic Assessments: The Depression and Anxiety Case" investigates the application of LLMs in mental health diagnostics, with a focus on major depressive disorder (MDD) and generalized anxiety disorder (GAD) using the Patient Health Questionnaire-9 (PHQ-9) and Generalized Anxiety Disorder-7 (GAD-7) questionnaires, respectively. The primary objective is to assess how effectively LLMs can replicate clinical assessment procedures to aid in diagnostics, potentially easing the burdens associated with high patient loads and a shortage of mental healthcare providers.

The paper employs both prompting and fine-tuning strategies to guide LLMs towards generating clinically relevant outputs, examining proprietary models such as GPT-3.5 and GPT-4o, as well as open-source models, including llama-3.1-8b and mixtral-8x7b. For fine-tuning, models like Mentalllama and Llama are utilized. Prompting methods used include naive prompting, exemplar-based prompting, and guidance-based prompting, while fine-tuning utilizes approaches like supervised fine-tuning (SFT), reinforcement learning with human feedback (RLHF), and direct preference optimization (DPO).

The authors also introduce a novel model named DiagnosticLlama, fine-tuned for questionnaire-specific criteria. They created a collection of annotated datasets using the PRIMATE dataset, enriched with expert clinician evaluations to ensure high-quality ground truth examples for PHQ-9 and GAD-7 symptom detection.

The paper makes the following contributions:

  1. Prompting and Fine-tuning Evaluation: The paper evaluates LLM responses using metrics such as hits@k and standard classification metrics (accuracy, precision, recall, F1-score). It is found that both the proprietary and open-source LLMs approach the quality of human annotations. Nevertheless, the fine-tuning process remains complex, requiring significant resources and precise hyperparameter tuning.
  2. DiagnosticLlama Model: By fine-tuning Mentalllama using the PRIMATE dataset, the DiagnosticLlama achieved competitive results in the task of diagnostic criteria assessment.
  3. Dataset and Artifact Release: The authors release various datasets, annotated outputs, and the DiagnosticLlama model. These are designed to facilitate further research and development of LLM-powered diagnostic assessment.

Overall, the paper finds that while current LLMs can produce responses that are close to human assessments when few-shot settings or specialized fine-tuning are employed, the models still fall short in mimicking clinician reasoning. This highlights the potential for LLMs to assist in mental health diagnostics, although more work is needed to ensure safety and reliability in clinical applications. The investigation underscores the importance of integrating LLMs with expert-endorsed knowledge to effectively mimic standardized mental health assessment protocols. Future work aims to expand the scope of these models, enhance diagnostic precision, and incorporate safety and privacy measures for real-world application.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (7)
  1. Kaushik Roy (265 papers)
  2. Harshul Surana (4 papers)
  3. Darssan Eswaramoorthi (2 papers)
  4. Yuxin Zi (8 papers)
  5. Vedant Palit (6 papers)
  6. Ritvik Garimella (3 papers)
  7. Amit Sheth (127 papers)