Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
166 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
42 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Inappropriate Pause Detection In Dysarthric Speech Using Large-Scale Speech Recognition (2402.18923v1)

Published 29 Feb 2024 in cs.CL, cs.SD, and eess.AS

Abstract: Dysarthria, a common issue among stroke patients, severely impacts speech intelligibility. Inappropriate pauses are crucial indicators in severity assessment and speech-language therapy. We propose to extend a large-scale speech recognition model for inappropriate pause detection in dysarthric speech. To this end, we propose task design, labeling strategy, and a speech recognition model with an inappropriate pause prediction layer. First, we treat pause detection as speech recognition, using an automatic speech recognition (ASR) model to convert speech into text with pause tags. According to the newly designed task, we label pause locations at the text level and their appropriateness. We collaborate with speech-language pathologists to establish labeling criteria, ensuring high-quality annotated data. Finally, we extend the ASR model with an inappropriate pause prediction layer for end-to-end inappropriate pause detection. Moreover, we propose a task-tailored metric for evaluating inappropriate pause detection independent of ASR performance. Our experiments show that the proposed method better detects inappropriate pauses in dysarthric speech than baselines. (Inappropriate Pause Error Rate: 14.47%)

Definition Search Book Streamline Icon: https://streamlinehq.com
References (21)
  1. “World stroke organization (wso): global stroke fact sheet 2022,” International Journal of Stroke, vol. 17, no. 1, pp. 18–29, 2022.
  2. “The incidence, co-occurrence, and predictors of dysphagia, dysarthria, and aphasia after first-ever acute ischemic stroke,” Journal of communication disorders, vol. 46, no. 3, pp. 238–248, 2013.
  3. “The influence of acquired dysarthria on conversational turn-taking,” Clinical linguistics & phonetics, vol. 15, no. 5, pp. 383–398, 2001.
  4. “A feasibility randomized controlled trial of readyspeech for people with dysarthria after stroke,” Clinical Rehabilitation, vol. 32, no. 8, pp. 1037–1046, 2018.
  5. “Perceptual evaluation for automatic anomaly detection in disordered speech: Focus on ambiguous cases,” Speech Communication, vol. 105, pp. 23–33, 2018.
  6. HyangHee Kim, “Dysarthria evaluation,” Communication Sciences & Disorders, pp. 23–28, 2005.
  7. Leeseul Shim Jiyeon Han, Okbun Lee, “The study of breath group based on oral airflow in reading by healthy speakers,” Speech Sciences, vol. 15, no. 4, pp. 135–146, 2008.
  8. “Speech and pause characteristics following speech rate reduction in hypokinetic dysarthria,” Journal of Communication Disorders, vol. 29, no. 6, pp. 429–445, 1996.
  9. “Automatic method of pause measurement for normal and dysarthric speech,” Clinical Linguistics & Phonetics, vol. 24, no. 2, pp. 141–154, 2010.
  10. “Algorithmic estimation of pauses in extended speech samples of dysarthric and typical speech,” J. Med. Speech. Lang. Pathol., vol. 12, no. 4, pp. 149–154, 2004.
  11. “Improving automatic forced alignment for dysarthric speech transcription,” in Proc. Interspeech 2015, 2015, pp. 2991–2995.
  12. G. Diwakar and Veena Karjigi, “Improving speech to text alignment based on repetition detection for dysarthric speech,” Circuits, Systems, and Signal Processing, vol. 39, no. 11, pp. 5543–5567, 2020.
  13. Sim Hyun Sub Kim Ki Eun, “The reading rate characteristics of adults with cerebral palsy,” Journal of Special Education, vol. 34, no. 4, pp. 49–72, 2001.
  14. “Speech rate and pause characteristics in patients with parkinson’s disease,” Phonetics and Speech Sciences, vol. 2, no. 4, pp. 173–184, 2010.
  15. John S Garofolo, “Timit acoustic phonetic continuous speech corpus,” Linguistic Data Consortium, 1993, 1993.
  16. “The impact of parkinson’s disease on breath pauses and their relationship to speech impairment: A longitudinal study,” American Journal of Speech-Language Pathology, vol. 29, pp. 1–13, 07 2020.
  17. “Robust speech recognition via large-scale weak supervision,” arXiv preprint arXiv:2212.04356, 2022.
  18. “Decoupled weight decay regularization,” in International Conference on Learning Representations, 2017.
  19. “Montreal Forced Aligner: Trainable Text-Speech Alignment Using Kaldi,” in Proc. Interspeech 2017, 2017, pp. 498–502.
  20. “Conformer: Convolution-augmented transformer for speech recognition,” in Proc. Interspeech 2020, 2020, pp. 5036–5040.
  21. “wav2vec 2.0: A framework for self-supervised learning of speech representations,” in Advances in Neural Information Processing Systems, 2020, vol. 33, pp. 12449–12460.
Citations (1)

Summary

  • The paper introduces a novel ASR-based method that accurately detects inappropriate pauses in dysarthric speech with a 14.47% error rate.
  • The methodology integrates a task-specific layer and expert labeling from speech-language pathologists to improve pause detection precision.
  • Experimental results show robust detection performance across varying dysarthria severities while simultaneously enhancing overall ASR accuracy.

Enhancing Dysarthric Speech Analysis: Inappropriate Pause Detection Using Advanced ASR Models

Introduction to Inappropriate Pause Detection in Dysarthric Speech

Dysarthria, primarily resulting from stroke, significantly impairs an individual's ability to control muscles used for speech, thus affecting their speech intelligibility. Herein, we delve into a novel approach aimed at improving automatic detection and assessment of inappropriate pauses in dysarthric speech. This method leverages a large-scale speech recognition model, extending it with a task-specific layer for detecting these pauses, thereby offering substantial support in the domain of speech-language therapy.

Methodology Overview

Unlike traditional methods that predominantly focus on detecting pauses using amplitude thresholds or forced alignment techniques, this paper proposes treating pause detection as a speech recognition problem. This methodology introduces an automatic speech recognition (ASR) model equipped to identify pauses designated as distinct tokens, thus marking a significant pivot from prior pause detection techniques. Key steps in the approach include:

  • Utilizing an ASR Model for Pause Detection: By inputting speech into the ASR model, text output inclusive of pause tags is produced, essentially treating pause detection as an integrated part of the speech-to-text conversion process.
  • Labeling Strategy and Model Architecture: A collaboratively developed labeling strategy with speech-language pathologists ensures high-quality data annotation. Task-specific layers, notably an inappropriate pause prediction layer, are appended to the ASR model to facilitate end-to-end detection of inappropriate pauses in dysarthric speech.
  • Introduction of a Novel Evaluation Metric: A task-tailored metric is conceptualized to evaluate the performance of inappropriate pause detection independently of ASR accuracy, thereby providing a more nuanced insight into the model's efficacy in this specific task.

Experimental Insights

The experiments conducted exhibit a preference for incorporating pause detection directly into the ASR model, highlighting several critical outcomes:

  • Performance Superiority: The proposed method demonstrates enhanced detection of inappropriate pauses in dysarthric speech across various dysarthria severity levels compared to traditional baseline methods. Notably, the Inappropriate Pause Error Rate stands at 14.47%, marking a significant improvement.
  • Severability Robustness: The model's performance in identifying inappropriate pauses remains consistent across different levels of dysarthria severity, which is paramount for a model to be practically applied in a clinical setting for providing diagnostics and feedback across the spectrum of dysarthria severity.
  • ASR Performance Improvement: Incorporating pause detection into the ASR model not only focuses on pause detection accuracy but also yields an improvement in the overall ASR performance. This demonstrates a symbiotic enhancement where addressing specific characteristics of dysarthric speech, such as inappropriate pauses, concurrently benefits broader speech recognition tasks.

Theoretical and Practical Implications

From a theoretical standpoint, this paper proposes an innovative approach to understanding and analyzing dysarthric speech, spotlighting the integration of pause detection within the ASR framework rather than treating it as a separate or subsequent analysis phase. Practically, it provides a scalable and efficient methodology for enhancing speech-language therapy for dysarthric speakers, with the potential for application across different languages and dialects.

Future Directions in AI and Speech Language Pathology

Looking ahead, extending and refining the architecture to accommodate various decoding strategies beyond the specific models tested, such as whisper, could broaden the applicability of this method. Furthermore, collaboration between AI research and speech-language pathology could yield more nuanced and effective tools for diagnosing and treating speech disorders, ultimately contributing to a significant leap in therapeutic outcomes for individuals with dysarthria.

In conclusion, the presented paper offers a substantial leap towards integrating automatic speech recognition technologies with speech disorder therapy, enhancing our capability to detect and assess inappropriate pauses in dysarthric speech efficiently. This advancement stands to significantly bolster the toolkit available for speech-language pathologists, offering a data-driven approach to therapy that is both precise and tailored to the individual needs of patients across the severity spectrum of dysarthria.