Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash 100 tok/s
Gemini 2.5 Pro 58 tok/s Pro
GPT-5 Medium 29 tok/s
GPT-5 High 29 tok/s Pro
GPT-4o 103 tok/s
GPT OSS 120B 480 tok/s Pro
Kimi K2 215 tok/s Pro
2000 character limit reached

Towards Hierarchical Spoken Language Dysfluency Modeling (2401.10015v2)

Published 18 Jan 2024 in cs.CL and eess.AS

Abstract: Speech disfluency modeling is the bottleneck for both speech therapy and language learning. However, there is no effective AI solution to systematically tackle this problem. We solidify the concept of disfluent speech and disfluent speech modeling. We then present Hierarchical Unconstrained Disfluency Modeling (H-UDM) approach, the hierarchical extension of UDM that addresses both disfluency transcription and detection to eliminate the need for extensive manual annotation. Our experimental findings serve as clear evidence of the effectiveness and reliability of the methods we have introduced, encompassing both transcription and detection tasks.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (41)
  1. Cmu phoneme dictionary. http://www.speech.cs.cmu.edu/cgi-bin/cmudict.
  2. Scaling laws for generative mixed-modal language models. International Conference on Machine Learning.
  3. Sequence labeling to detect stuttering events in read speech. Computer Speech & Language, 62:101052.
  4. Automatic recognition of children’s read speech for stuttering application. In 6th. Workshop on Child Computer Interaction (WOCCI 2017), eds. K. Evanini, M. Najafian, S. Safavi and K. Berkling, pages 1–6. International Speech Communication Association (ISCA).
  5. Whisperx: Time-accurate speech transcription of long-form audio. Interspeech.
  6. Speech and language therapy for aphasia following stroke. Cochrane database of systematic reviews, (6).
  7. ChatGPT. 2022. Chatgpt. https://chat.openai.com/.
  8. Wavlm: Large-scale self-supervised pre-training for full stack speech processing. IEEE Journal of Selected Topics in Signal Processing, 16(6):1505–1518.
  9. Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks. In Proceedings of the 23rd international conference on Machine learning, pages 369–376.
  10. Conformer: Convolution-augmented Transformer for Speech Recognition. In Proc. Interspeech 2020, pages 5036–5040.
  11. Frame-level stutter detection. In Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH, volume 2022, pages 2843–2847.
  12. HugginFace-WavLM. 2022. Wavlm-ctc-hugginface. https://huggingface.co/microsoft/wavlm-large.
  13. Melanie Jouaiti and Kerstin Dautenhahn. 2022. Dysfluency classification in stuttered speech using deep learning for real-time applications. In ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 6482–6486.
  14. Conditional variational autoencoder with adversarial learning for end-to-end text-to-speech. In International Conference on Machine Learning, pages 5530–5540. PMLR.
  15. Fluentnet: End-to-end detection of stuttered speech disfluencies with deep learning. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 29:2986–2999.
  16. Weakly-supervised forced alignment of disfluent speech using phoneme-level modeling. Interspeech.
  17. Ctc-segmentation of large corpora for german end-to-end speech recognition. In Speech and Computer: 22nd International Conference, SPECOM 2020, St. Petersburg, Russia, October 7–9, 2020, Proceedings 22, pages 267–278. Springer.
  18. Neufa: Neural network based end-to-end forced alignment with bidirectional attention mechanism. In ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 8007–8011. IEEE.
  19. Av-data2vec: Self-supervised learning of audio-visual speech representations with contextualized target representations. arXiv preprint arXiv:2302.06419.
  20. Deep Neural Convolutive Matrix Factorization for Articulatory Representation Decomposition. In Proc. Interspeech 2022, pages 4686–4690.
  21. Articulatory representation learning via joint factor analysis and neural matrix factorization. In ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 1–5. IEEE.
  22. Unconstrained dysfluency modeling for dysfluent speech transcription and detection. arXiv preprint arXiv:2312.12810.
  23. Unsupervised tts acoustic modeling for tts with conditional disentangled sequential vae. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 31:2548–2557.
  24. Towards Improved Zero-shot Voice Conversion with Conditional DSVAE. In Proc. Interspeech 2022, pages 2598–2602.
  25. Robust disentangled variational speech representation learning for zero-shot voice conversion. In ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 6572–6576. IEEE.
  26. Montreal forced aligner: Trainable text-speech alignment using kaldi. In Interspeech, volume 2017, pages 498–502.
  27. NIDCD. 2016. Nidcd. https://www.nidcd.nih.gov/health/statistics/quick- statistics-voice-speech-language.
  28. The buckeye corpus of conversational speech: Labeling conventions and a test of transcriber reliability. Speech Communication, 45(1):89–95.
  29. Scaling speech technology to 1,000+ languages. arXiv preprint arXiv:2305.13516.
  30. Robust speech recognition via large-scale weak supervision. In International Conference on Machine Learning, pages 28492–28518. PMLR.
  31. Enhancing asr for stuttered speech with limited data using detect and pass. arXiv preprint arXiv:2202.05396.
  32. sklearn F1. Micro-macro-f1. https://scikit-learn.org/stable/modules/
  33. Margaret J Snowling and Joy Stackhouse. 2013. Dyslexia, speech and language: a practitioner’s handbook. John Wiley & Sons.
  34. Luke Strgar and David Harwath. 2023. Phoneme segmentation using self-supervised speech models. In 2022 IEEE Spoken Language Technology Workshop (SLT), pages 1067–1073.
  35. Silero Team. 2021. Silero vad: pre-trained enterprise-grade voice activity detector (vad), number detector and language classifier. https://github.com/snakers4/silero-vad.
  36. VCL. 2021. Vcl. https://vclenglish.com/54-8-billion-by-2025-
  37. W. Verhelst and M. Roelands. 1993. An overlap-add technique based on waveform similarity (wsola) for high quality time-scale modification of speech. In 1993 IEEE International Conference on Acoustics, Speech, and Signal Processing, volume 2, pages 554–557 vol.2.
  38. Speaker-independent acoustic-to-articulatory speech inversion. In ICASSP 2023 - 2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 1–5.
  39. Deep Speech Synthesis from MRI-Based Articulatory Representations. In Proc. INTERSPEECH 2023, pages 5132–5136.
  40. Cstr vctk corpus: English multi-speaker corpus for cstr voice cloning toolkit (version 0.92). University of Edinburgh. The Centre for Speech Technology Research (CSTR).
  41. Google usm: Scaling automatic speech recognition beyond 100 languages. arXiv preprint arXiv:2303.01037.
Citations (8)
List To Do Tasks Checklist Streamline Icon: https://streamlinehq.com

Collections

Sign up for free to add this paper to one or more collections.

Summary

We haven't generated a summary for this paper yet.

Ai Generate Text Spark Streamline Icon: https://streamlinehq.com

Paper Prompts

Sign up for free to create and run prompts on this paper using GPT-5.

Dice Question Streamline Icon: https://streamlinehq.com

Follow-up Questions

We haven't generated follow-up questions for this paper yet.

X Twitter Logo Streamline Icon: https://streamlinehq.com

Tweets