Papers

Topics

Authors

Recent

View all

Gemini 2.5 Flash

Gemini 2.5 Flash 78 tok/s

Gemini 2.5 Pro 43 tok/s Pro

GPT-5 Medium 23 tok/s

GPT-5 High 29 tok/s Pro

GPT-4o 93 tok/s

GPT OSS 120B 470 tok/s Pro

Kimi K2 183 tok/s Pro

2000 character limit reached

Speech-Based Estimation of Schizophrenia Severity Using Feature Fusion (2411.06033v4)

Published 9 Nov 2024 in eess.AS

Abstract: Speech-based assessment of the schizophrenia spectrum has been widely researched over in the recent past. In this study, we develop a deep learning framework to estimate schizophrenia severity scores from speech using a feature fusion approach that fuses articulatory features with different self-supervised speech features extracted from pre-trained audio models. We also propose an auto-encoder-based self-supervised representation learning framework to extract compact articulatory embeddings from speech. Our top-performing speech-based fusion model with Multi-Head Attention (MHA) reduces Mean Absolute Error (MAE) by 9.18% and Root Mean Squared Error (RMSE) by 9.36% for schizophrenia severity estimation when compared with the previous models that combined speech and video inputs.

Collections

Summary

The paper introduces a dual-feature fusion model combining self-supervised speech representations and articulatory features to assess schizophrenia severity.
The methodology employs a multi-head attention mechanism and an autoencoder framework, achieving a 9% reduction in MAE and RMSE compared to previous methods.
The results underscore the potential of advanced speech analysis in improving mental health diagnostics and informing clinical practices.

An Overview of Speech-Based Estimation of Schizophrenia Severity Using Feature Fusion

The paper "Speech-Based Estimation of Schizophrenia Severity Using Feature Fusion" presents a sophisticated approach to assess schizophrenia severity through audio-based features. This paper introduces a dual-factor model that integrates articulatory features with self-supervised speech representations obtained from advanced pre-trained audio models, operationalized through a deep learning framework enhanced by feature fusion. The proposed methodology addresses a significant gap in mental health diagnostics by leveraging multimodal audio data, showcasing a notable advancement in the analysis of speech biomarkers for schizophrenia.

Methodology and Model Architecture

The core of the research is a feature fusion model incorporating multi-head attention (MHA) to blend self-supervised speech representations and articulatory features. The proposed model surpasses previous techniques by achieving a reduction in Mean Absolute Error (MAE) by 9.18% and Root Mean Squared Error (RMSE) by 9.36% relative to methods that combined speech and video inputs.

An autoencoder-based framework supports the extraction of compact articulatory embeddings from speech. The paper capitalizes on self-supervised learning models, such as Wav2Vec2.0 and WavLM, which offer generalized speech representations derived from substantial speech corpora, enhancing model robustness across diverse environments and speaker types. The structure features two CNN branches for handling self-supervised speech and articulatory representations before leveraging multi-head attention to fuse the data for a comprehensive severity score prediction.

Dataset and Evaluation

The dataset utilized in this paper originated from an interdisciplinary project, including audio recordings from individuals with schizophrenia, depression, and healthy controls. These recordings were analyzed for vocal characteristics related to mental states, based on the Brief Psychiatric Rating Scale (BPRS). Through rigorous data preprocessing and feature extraction, the authors separated speaker segments and obtained self-supervised and articulatory features for analysis.

Key Findings and Performance

The models were trained and validated using a division into training, testing, and validation sets. The feature fusion model outperformed both unimodal approaches as well as prior multimodal methods that incorporated video inputs. This superiority is evident in various evaluation metrics such as MAE, RMSE, and Spearman's rank correlation coefficient, confirming the feature fusion model's strength in handling schizophrenia severity estimation tasks.

Implications and Future Directions

This paper's approach to integrating articulatory and self-supervised features suggests a significant potential for enhancing the assessment of mental health disorders. The improvements in accuracy and consistency of severity estimation underscore the effectiveness of the proposed fusion strategy. Moving forward, the potential application of similar fusion techniques in other mental health diagnostics could be transformative.

The research opens pathways for further exploration into combining multimodal fusion techniques, possibly integrating textual or additional physiological data to construct more encompassing models. Future iterations might also explore real-time applications of such models in clinical settings, providing valuable tools for mental health practitioners. Overall, this paper adds a critical piece to the ongoing development of AI-driven diagnostic tools, emphasizing speech-based assessment's role in mental health evaluation.

PDF Markdown

Paper Prompts

Explore 10 Community Prompts

Follow-up Questions

We haven't generated follow-up questions for this paper yet.

Generate Now

Authors (2)

Tweets

https://twitter.com/ArxivSound/status/1859067805863186738

https://twitter.com/AudioAndSpeech/status/1856230763680690315