Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash 78 tok/s
Gemini 2.5 Pro 43 tok/s Pro
GPT-5 Medium 23 tok/s
GPT-5 High 29 tok/s Pro
GPT-4o 93 tok/s
GPT OSS 120B 470 tok/s Pro
Kimi K2 183 tok/s Pro
2000 character limit reached

Unconstrained Dysfluency Modeling for Dysfluent Speech Transcription and Detection (2312.12810v1)

Published 20 Dec 2023 in eess.AS and cs.SD

Abstract: Dysfluent speech modeling requires time-accurate and silence-aware transcription at both the word-level and phonetic-level. However, current research in dysfluency modeling primarily focuses on either transcription or detection, and the performance of each aspect remains limited. In this work, we present an unconstrained dysfluency modeling (UDM) approach that addresses both transcription and detection in an automatic and hierarchical manner. UDM eliminates the need for extensive manual annotation by providing a comprehensive solution. Furthermore, we introduce a simulated dysfluent dataset called VCTK++ to enhance the capabilities of UDM in phonetic transcription. Our experimental results demonstrate the effectiveness and robustness of our proposed methods in both transcription and detection tasks.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (37)
  1. “Speech and language therapy for aphasia following stroke,” Cochrane database of systematic reviews, , no. 6, 2016.
  2. Dyslexia, speech and language: a practitioner’s handbook, John Wiley & Sons, 2013.
  3. “Pattern search in dysfluent speech,” in 2012 IEEE International Workshop on Machine Learning for Signal Processing. IEEE, 2012, pp. 1–6.
  4. “The buckeye corpus of conversational speech: Labeling conventions and a test of transcriber reliability,” Speech Communication, vol. 45, no. 1, pp. 89–95, 2005.
  5. “Weakly-supervised forced alignment of disfluent speech using phoneme-level modeling,” Interspeech, 2023.
  6. “Robust speech recognition via large-scale weak supervision,” in International Conference on Machine Learning. PMLR, 2023, pp. 28492–28518.
  7. “Av-data2vec: Self-supervised learning of audio-visual speech representations with contextualized target representations,” arXiv preprint arXiv:2302.06419, 2023.
  8. “Google usm: Scaling automatic speech recognition beyond 100 languages,” arXiv preprint arXiv:2303.01037, 2023.
  9. “Scaling speech technology to 1,000+ languages,” arXiv preprint arXiv:2305.13516, 2023.
  10. “Scaling laws for generative mixed-modal language models,” International Conference on Machine Learning, 2023.
  11. “Wavlm-ctc-hugginface,” https://huggingface.co/microsoft/wavlm-large.
  12. “Whisperx: Time-accurate speech transcription of long-form audio,” Interspeech, 2023.
  13. “Fluentnet: End-to-end detection of stuttered speech disfluencies with deep learning,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 29, pp. 2986–2999, 2021.
  14. “Automatic recognition of children’s read speech for stuttering application,” in 6th. Workshop on Child Computer Interaction (WOCCI 2017), eds. K. Evanini, M. Najafian, S. Safavi and K. Berkling. International Speech Communication Association (ISCA), 2017, pp. 1–6.
  15. “Sequence labeling to detect stuttering events in read speech,” Computer Speech & Language, vol. 62, pp. 101052, 2020.
  16. “Dysfluency classification in stuttered speech using deep learning for real-time applications,” in ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2022, pp. 6482–6486.
  17. “Frame-level stutter detection,” in Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH, 2022, vol. 2022, pp. 2843–2847.
  18. “Enhancing asr for stuttered speech with limited data using detect and pass,” arXiv preprint arXiv:2202.05396, 2022.
  19. “Wavlm: Large-scale self-supervised pre-training for full stack speech processing,” IEEE Journal of Selected Topics in Signal Processing, vol. 16, no. 6, pp. 1505–1518, 2022.
  20. “Conformer: Convolution-augmented Transformer for Speech Recognition,” in Proc. Interspeech 2020, 2020, pp. 5036–5040.
  21. “Montreal forced aligner: Trainable text-speech alignment using kaldi.,” in Interspeech, 2017, vol. 2017, pp. 498–502.
  22. “Cstr vctk corpus: English multi-speaker corpus for cstr voice cloning toolkit (version 0.92),” University of Edinburgh. The Centre for Speech Technology Research (CSTR), 2019.
  23. “Cmu phoneme dictionary,” http://www.speech.cs.cmu.edu/cgi-bin/cmudict.
  24. “Phoneme segmentation using self-supervised speech models,” in 2022 IEEE Spoken Language Technology Workshop (SLT), 2023, pp. 1067–1073.
  25. “Robust disentangled variational speech representation learning for zero-shot voice conversion,” in ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2022, pp. 6572–6576.
  26. “Towards Improved Zero-shot Voice Conversion with Conditional DSVAE,” in Proc. Interspeech 2022, 2022, pp. 2598–2602.
  27. “Unsupervised tts acoustic modeling for tts with conditional disentangled sequential vae,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 31, pp. 2548–2557, 2023.
  28. W. Verhelst and M. Roelands, “An overlap-add technique based on waveform similarity (wsola) for high quality time-scale modification of speech,” in 1993 IEEE International Conference on Acoustics, Speech, and Signal Processing, 1993, vol. 2, pp. 554–557 vol.2.
  29. Silero Team, “Silero vad: pre-trained enterprise-grade voice activity detector (vad), number detector and language classifier,” https://github.com/snakers4/silero-vad, 2021.
  30. “Ctc-segmentation of large corpora for german end-to-end speech recognition,” in Speech and Computer: 22nd International Conference, SPECOM 2020, St. Petersburg, Russia, October 7–9, 2020, Proceedings 22. Springer, 2020, pp. 267–278.
  31. “Hubert: Self-supervised speech representation learning by masked prediction of hidden units,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 29, pp. 3451–3460, 2021.
  32. “Deep Neural Convolutive Matrix Factorization for Articulatory Representation Decomposition,” in Proc. Interspeech 2022, 2022, pp. 4686–4690.
  33. “Articulatory representation learning via joint factor analysis and neural matrix factorization,” in ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2023, pp. 1–5.
  34. “Speaker-independent acoustic-to-articulatory speech inversion,” in ICASSP 2023 - 2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2023, pp. 1–5.
  35. “Deep Speech Synthesis from MRI-Based Articulatory Representations,” in Proc. INTERSPEECH 2023, 2023, pp. 5132–5136.
  36. “Listen, think, and understand,” arXiv preprint arXiv:2305.10790, 2023.
  37. “Audiopalm: A large language model that can speak and listen,” arXiv preprint arXiv:2306.12925, 2023.
Citations (12)
List To Do Tasks Checklist Streamline Icon: https://streamlinehq.com

Collections

Sign up for free to add this paper to one or more collections.

Summary

We haven't generated a summary for this paper yet.

Ai Generate Text Spark Streamline Icon: https://streamlinehq.com

Paper Prompts

Sign up for free to create and run prompts on this paper using GPT-5.

Dice Question Streamline Icon: https://streamlinehq.com

Follow-up Questions

We haven't generated follow-up questions for this paper yet.