CAtCh: Cognitive Assessment through Cookie Thief

Published 7 Jun 2025 in cs.LG, cs.AI, cs.SD, and eess.AS | (2506.06603v1)

Abstract: Several machine learning algorithms have been developed for the prediction of Alzheimer's disease and related dementia (ADRD) from spontaneous speech. However, none of these algorithms have been translated for the prediction of broader cognitive impairment (CI), which in some cases is a precursor and risk factor of ADRD. In this paper, we evaluated several speech-based open-source methods originally proposed for the prediction of ADRD, as well as methods from multimodal sentiment analysis for the task of predicting CI from patient audio recordings. Results demonstrated that multimodal methods outperformed unimodal ones for CI prediction, and that acoustics-based approaches performed better than linguistics-based ones. Specifically, interpretable acoustic features relating to affect and prosody were found to significantly outperform BERT-based linguistic features and interpretable linguistic features, respectively. All the code developed for this study is available at https://github.com/JTColonel/catch.

Abstract PDF Upgrade to Chat

Authors (8)

Summary

An Evaluation of Speech-Based Methods for Cognitive Impairment Prediction

The paper "CAtCh: Cognitive Assessment through Cookie Thief" explores machine learning (ML) approaches for predicting cognitive impairment (CI) using speech data derived from the Cookie Thief Test (CTT). This evaluation focuses on the adaptation of algorithms initially designed for Alzheimer's disease and related dementias (ADRD) to the broader prediction of CI, and investigates whether multimodal approaches leveraging speech and linguistic features provide effective predictive capabilities.

Methodological Overview

The authors systematically evaluate and contrast several open-source ML methods originating from ADRD prediction, alongside approaches from multimodal sentiment analysis (MSA), in predicting CI. These methods include:

Heitz et al.: Employs automatic speech recognition (ASR) for deriving linguistic features, followed by a random forest classifier.
Chen et al.: Utilizes pre-trained HuBERT models for acoustic feature extraction, adopting a supervised fine-tuning approach.
Ying et al.: Incorporates Wav2Vec2.0 (W2V2) and BERT for unimodal embeddings, supplemented by IS10 acoustic features for a multimodal setup.
Farzana et al.: Leverages both linguistic and prosodic features in a multimodal framework, employing SVM for classification.
MISA and MFN: MSA architectures adapted to CI prediction, employing text and audio modalities without video.

To assess these methods, the authors apply a robust evaluation protocol, comprising 100 stratified train-test splits and repetitive performance evaluation via AUC and Fmax metrics.

Key Findings

Performance Highlights

Multimodal Methods Superiority: Consistent with expectations, multimodal methods generally outperform unimodal ones, although a significant exception is noted with Farzana et al.'s method, where an acoustics-only configuration proved superior.
Acoustic Features Superiority: Acoustic features, especially those interpretable—such as prosody from the DisVoice toolbox and IS10 features—outperformed linguistic features in several configurations. This denotes a potential advantage of acoustic signatures over linguistic structures in detecting CI across diverse populations.

Specific Method Insights

MISA and MFN: These methods from MSA, while showing potential, did not outperform Ying et al.'s configuration with statistical significance. The study hints at future improvements with larger datasets and inclusion of a third, potentially video, modality.
Ying et al.: Demonstrated that avoiding fine-tuning the upstream models results in slightly improved outcomes, mitigating potential overfitting concerns—a vital insight given the limited data size.

Broader Implications

The findings suggest a significant application potential for automated, non-invasive CI screening in clinical settings. The established superiority of acoustic features over linguistic ones in CI prediction could advocate for their prioritization in future models, especially across multilingual cohorts. Furthermore, the paper emphasizes the prospect of enhancing MSA models by incorporating additional modalities and optimizing training regimes to accommodate data constraints.

Conclusion and Future Work

The study paves the way for refining speech-based CI prediction, emphasizing the promising roles of interpretable acoustic features and the challenges of model fine-tuning given limited data. Future research could focus on expanding datasets, exploring additional modalities, and developing more sophisticated multimodal fusion techniques. The repository released by the authors stands as a resourceful asset for the community, facilitating continued advancements in this domain.