Leveraging Audio and Text Modalities in Mental Health: A Study of LLMs Performance (2412.10417v1)

Published 9 Dec 2024 in cs.CL, cs.AI, cs.SD, and eess.AS

Abstract: Mental health disorders are increasingly prevalent worldwide, creating an urgent need for innovative tools to support early diagnosis and intervention. This study explores the potential of LLMs in multimodal mental health diagnostics, specifically for detecting depression and Post Traumatic Stress Disorder through text and audio modalities. Using the E-DAIC dataset, we compare text and audio modalities to investigate whether LLMs can perform equally well or better with audio inputs. We further examine the integration of both modalities to determine if this can enhance diagnostic accuracy, which generally results in improved performance metrics. Our analysis specifically utilizes custom-formulated metrics; Modal Superiority Score and Disagreement Resolvement Score to evaluate how combined modalities influence model performance. The Gemini 1.5 Pro model achieves the highest scores in binary depression classification when using the combined modality, with an F1 score of 0.67 and a Balanced Accuracy (BA) of 77.4%, assessed across the full dataset. These results represent an increase of 3.1% over its performance with the text modality and 2.7% over the audio modality, highlighting the effectiveness of integrating modalities to enhance diagnostic accuracy. Notably, all results are obtained in zero-shot inferring, highlighting the robustness of the models without requiring task-specific fine-tuning. To explore the impact of different configurations on model performance, we conduct binary, severity, and multiclass tasks using both zero-shot and few-shot prompts, examining the effects of prompt variations on performance. The results reveal that models such as Gemini 1.5 Pro in text and audio modalities, and GPT-4o mini in the text modality, often surpass other models in balanced accuracy and F1 scores across multiple tasks.

Summary

The paper assesses LLM performance in diagnosing mental health via audio and text, demonstrating that multimodal inputs enhance accuracy over single modalities.
Using the E-DAIC dataset, the study found that multimodal methods, like Gemini 1.5 Pro achieving 77% BA for depression, generally outperformed single modalities.
The study introduced novel metrics (MSS, DRS) to quantify multimodal impact and suggests this approach shows promise for improving clinical diagnostic tools, requiring further validation.

The paper "Leveraging Audio and Text Modalities in Mental Health: A Study of LLMs Performance" explores the potential of LLMs in diagnosing mental health concerns, specifically depression and PTSD, through both audio and text inputs. This paper utilized the E-DAIC dataset to assess LLM performance and investigate the benefits of integrating multimodal inputs for increased accuracy in diagnostics.

The primary goals of this paper were:

Comparison of Modalities: The researchers aimed to determine how well LLMs perform when processing text data versus audio data, assessing capabilities in identifying vocal cues and linguistic patterns indicative of mental health disorders.
Multimodal Integration: They hypothesized that integrating both text and audio inputs could lead to a more accurate diagnostic process, thus improving model performance metrics like F1 score and Balanced Accuracy (BA).
Novel Metrics: Two custom-formulated metrics—the Modal Superiority Score (MSS) and Disagreement Resolvement Score (DRS)—were introduced to gauge the influence of combined modalities on performance, revealing that multimodal integration led to a 3.1% improvement in depression classification accuracy compared to text alone.

The paper scrutinized the zero-shot capabilities of various LLMs, notably Gemini 1.5 Pro and GPT-4o mini, across binary classification tasks, severity categorization, and multiclass configurations using text and audio modalities. In particular, results showed:

Gemini 1.5 Pro achieved a 77% BA in binary depression classification with the combined modality approach.
GPT-4o mini demonstrated significant success in PTSD classification with a BA of 77% and F1 score of 0.68.
Multimodal approaches using both text and audio modalities generally led to better performance than single-modality approaches, confirming the hypothesis of the paper.

Overall, the findings illuminate the promising role of LLMs in mental health assessment, indicating that a multimodal approach can enhance diagnostic precision. However, the paper emphasizes the need for clinical validation, continuous model adaptation, and consideration of ethical implications in AI-driven diagnostics. This work puts forward the integration of multiple data modalities as key to improving diagnostic tools within clinical settings.

PDF Markdown

Leveraging Audio and Text Modalities in Mental Health: A Study of LLMs Performance (2412.10417v1)

Summary

Related Papers