The paper "Exploring The Potential of LLMs for Assisting with Mental Health Diagnostic Assessments: The Depression and Anxiety Case" investigates the application of LLMs in mental health diagnostics, with a focus on major depressive disorder (MDD) and generalized anxiety disorder (GAD) using the Patient Health Questionnaire-9 (PHQ-9) and Generalized Anxiety Disorder-7 (GAD-7) questionnaires, respectively. The primary objective is to assess how effectively LLMs can replicate clinical assessment procedures to aid in diagnostics, potentially easing the burdens associated with high patient loads and a shortage of mental healthcare providers.
The paper employs both prompting and fine-tuning strategies to guide LLMs towards generating clinically relevant outputs, examining proprietary models such as GPT-3.5 and GPT-4o, as well as open-source models, including llama-3.1-8b and mixtral-8x7b. For fine-tuning, models like Mentalllama and Llama are utilized. Prompting methods used include naive prompting, exemplar-based prompting, and guidance-based prompting, while fine-tuning utilizes approaches like supervised fine-tuning (SFT), reinforcement learning with human feedback (RLHF), and direct preference optimization (DPO).
The authors also introduce a novel model named DiagnosticLlama, fine-tuned for questionnaire-specific criteria. They created a collection of annotated datasets using the PRIMATE dataset, enriched with expert clinician evaluations to ensure high-quality ground truth examples for PHQ-9 and GAD-7 symptom detection.
The paper makes the following contributions:
- Prompting and Fine-tuning Evaluation: The paper evaluates LLM responses using metrics such as hits@k and standard classification metrics (accuracy, precision, recall, F1-score). It is found that both the proprietary and open-source LLMs approach the quality of human annotations. Nevertheless, the fine-tuning process remains complex, requiring significant resources and precise hyperparameter tuning.
- DiagnosticLlama Model: By fine-tuning Mentalllama using the PRIMATE dataset, the DiagnosticLlama achieved competitive results in the task of diagnostic criteria assessment.
- Dataset and Artifact Release: The authors release various datasets, annotated outputs, and the DiagnosticLlama model. These are designed to facilitate further research and development of LLM-powered diagnostic assessment.
Overall, the paper finds that while current LLMs can produce responses that are close to human assessments when few-shot settings or specialized fine-tuning are employed, the models still fall short in mimicking clinician reasoning. This highlights the potential for LLMs to assist in mental health diagnostics, although more work is needed to ensure safety and reliability in clinical applications. The investigation underscores the importance of integrating LLMs with expert-endorsed knowledge to effectively mimic standardized mental health assessment protocols. Future work aims to expand the scope of these models, enhance diagnostic precision, and incorporate safety and privacy measures for real-world application.