Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
158 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
45 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Integrating Large Language Models into a Tri-Modal Architecture for Automated Depression Classification on the DAIC-WOZ (2407.19340v5)

Published 27 Jul 2024 in cs.CV, cs.AI, cs.LG, and cs.MM

Abstract: Major Depressive Disorder (MDD) is a pervasive mental health condition that affects 300 million people worldwide. This work presents a novel, BiLSTM-based tri-modal model-level fusion architecture for the binary classification of depression from clinical interview recordings. The proposed architecture incorporates Mel Frequency Cepstral Coefficients, Facial Action Units, and uses a two-shot learning based GPT-4 model to process text data. This is the first work to incorporate LLMs into a multi-modal architecture for this task. It achieves impressive results on the DAIC-WOZ AVEC 2016 Challenge cross-validation split and Leave-One-Subject-Out cross-validation split, surpassing all baseline models and multiple state-of-the-art models. In Leave-One-Subject-Out testing, it achieves an accuracy of 91.01%, an F1-Score of 85.95%, a precision of 80%, and a recall of 92.86%.

Citations (1)

Summary

  • The paper introduces a tri-modal architecture that fuses GPT-4 with audio, visual, and textual analysis for depression diagnosis.
  • It employs a BiLSTM backbone with MFCCs and FAUs for feature extraction and leverages two-shot learning to address limited text data.
  • Results demonstrate 91.01% accuracy with strong F1, precision, and recall metrics, highlighting its potential for AI-driven mental health diagnostics.

Integrating LLMs into a Tri-Modal Architecture for Automated Depression Classification

This paper presents a novel approach to the automated classification of Major Depressive Disorder (MDD) in clinical interview contexts, leveraging a tri-modal architecture that integrates audio, visual, and text data modalities. The integration of LLMs, specifically a GPT-4 model using two-shot learning, with more traditional machine learning methods marks a significant exploration in improving depression diagnosis through automated systems, particularly under conditions of data scarcity.

Methodology and Model Design

The crux of the proposed model is its tri-modal architecture employing a Bidirectional Long Short-Term Memory (BiLSTM) backbone. It uses Mel Frequency Cepstral Coefficients (MFCCs) to process the audio modality, extracting pertinent acoustic features relevant to depressive speech characteristics. For the visual modality, Facial Action Units (FAUs) are utilized to capture and analyze facial expressions and movements that differ between depressed and non-depressed individuals. The text modality, traditionally challenged by insufficient task-specific datasets, is processed using a GPT-4 model trained through two-shot learning. This is the first documented attempt to incorporate LLMs in a multimodal framework for depression classification, aimed at leveraging their training on extensive corpora to mitigate the prevalent issue of limited text data availability.

The fusion strategy adopted is model-level, which combines both early and late modalities, allowing for effective learning of both intra-modality and inter-modality patterns. This design capitalizes on the complementary strengths of different data types to bolster diagnostic accuracy.

Results

The proposed architecture's performance was robustly evaluated against the DAIC-WOZ AVEC 2016 task dataset using both the train/validation/test splits and the Leave-One-Subject-Out Cross-Validation (LOSOCV) approach. Results demonstrated a surpassing performance over baseline models and achieved a notable accuracy of 91.01% with an F1-Score of 85.95%, precision of 80%, and recall of 92.86%. For practical comparisons, the model's performance was benchmarked against state-of-the-art alternatives, showing superior metrics on the same dataset splits. The pronounced accuracies indicate the model's efficacy in diagnosing depression by effectively combining linguistic, acoustic, and visual assessments derived from real-world clinical interview settings.

Implications for Future Research

This research indicates a promising direction towards more nuanced, data-efficient methods for mental health diagnostics, leveraging AI and LLMs' potentialities. While the paper showcases significant advancements, the utilization of a relatively small and homogenous dataset like DAIC-WOZ highlights the need for further exploration within more diverse datasets to ensure broad applicability and generalization of results.

The computational complexity of the model, particularly involving LLMs, suggests optimizations are required for practical, real-time applications. Future work should explore parallel enhancements in model efficiency and data diversity to ensure robust, scalable architectural designs that extend beyond controlled environments.

Conclusion

Through this work, the potential for integrating LLMs within a multimodal framework for computational psychiatry is realized, providing new avenues for AI-driven diagnostic tools that could transform clinical practices in mental health. Nevertheless, the ongoing challenge remains to refine these models for broader, real-time applicability, thus enhancing their impact on patient care dynamics and the quality of depression diagnosis methods. The deployment and accessibility of such tools through practical applications, as exemplified by the DepScope web application, underscore a significant step toward operationalizing AI advancements in clinical settings.

X Twitter Logo Streamline Icon: https://streamlinehq.com
Youtube Logo Streamline Icon: https://streamlinehq.com