Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
162 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
45 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

An Analysis of Personalized Speech Recognition System Development for the Deaf and Hard-of-Hearing (2306.13953v1)

Published 24 Jun 2023 in cs.SD and eess.AS

Abstract: Deaf or hard-of-hearing (DHH) speakers typically have atypical speech caused by deafness. With the growing support of speech-based devices and software applications, more work needs to be done to make these devices inclusive to everyone. To do so, we analyze the use of openly-available automatic speech recognition (ASR) tools with a DHH Japanese speaker dataset. As these out-of-the-box ASR models typically do not perform well on DHH speech, we provide a thorough analysis of creating personalized ASR systems. We collected a large DHH speaker dataset of four speakers totaling around 28.05 hours and thoroughly analyzed the performance of different training frameworks by varying the training data sizes. Our findings show that 1000 utterances (or 1-2 hours) from a target speaker can already significantly improve the model performance with minimal amount of work needed, thus we recommend researchers to collect at least 1000 utterances to make an efficient personalized ASR system. In cases where 1000 utterances is difficult to collect, we also discover significant improvements in using previously proposed data augmentation techniques such as intermediate fine-tuning when only 200 utterances are available.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (22)
  1. M. J. Osberger and N. S. McGarr, “Speech production characteristics of the hearing impaired,” ser. Speech and Language, N. J. LASS, Ed., vol. 8.   Elsevier, 1982, pp. 221–283.
  2. F. Jelinek, “Continuous speech recognition by statistical methods,” Proceedings of the IEEE, vol. 64, no. 4, pp. 532–556, 1976.
  3. A. Graves, A.-R. Mohamed, and G. Hinton, “Speech recognition with deep recurrent neural networks,” in Proc. ICASSP, 2013, pp. 6645–6649.
  4. C. Lorenzi, G. Gilbert, H. Carn, S. Garnier, and B. Moore, “Speech perception problems of the hearing impaired reflect inability to use temporal fine structure,” Proceedings of the National Academy of Sciences of the United States of America, vol. 103, pp. 18 866–9, 01 2007.
  5. A. T. Glasser, K. R. Kushalnagar, and R. S. Kushalnagar, “Feasibility of using automatic speech recognition with voices of deaf and hard-of-hearing individuals,” in Proc. ASSETS.   New York, NY, USA: Association for Computing Machinery, 2017, p. 373–374.
  6. M. Moore, H. Venkateswara, and S. Panchanathan, “Whistle-blowing ASRs: Evaluating the Need for More Inclusive Speech Recognition Systems,” in Proc. Interspeech, 2018, pp. 466–470.
  7. L. P. Violeta, D. Ma, W.-C. Huang, and T. Toda, “Intermediate Fine-Tuning Using Imperfect Synthetic Speech for Improving Electrolaryngeal Speech Recognition,” in Proc. ICASSP.   IEEE, 2023.
  8. R. L. MacDonald, P.-P. Jiang, J. Cattiau, R. Heywood, R. Cave, K. Seaver, M. A. Ladewig, J. Tobin, M. P. Brenner, P. C. Nelson, J. R. Green, and K. Tomanek, “Disordered Speech Data Collection: Lessons Learned at 1 Million Utterances from Project Euphonia,” in Proc. Interspeech, 2021, pp. 4833–4837.
  9. J. Shor, D. Emanuel, O. Lang, O. Tuval, M. Brenner, J. Cattiau, F. Vieira, M. McNally, T. Charbonneau, M. Nollstadt, A. Hassidim, and Y. Matias, “Personalizing ASR for Dysarthric and Accented Speech with Limited Data,” in Proc. Interspeech, 2019, pp. 784–788.
  10. J. R. Green, R. L. MacDonald, P.-P. Jiang, J. Cattiau, R. Heywood, R. Cave, K. Seaver, M. A. Ladewig, J. Tobin, M. P. Brenner, P. C. Nelson, and K. Tomanek, “Automatic Speech Recognition of Disordered Speech: Personalized Models Outperforming Human Listeners on Short Phrases,” in Proc. Interspeech, 2021, pp. 4778–4782.
  11. L. P. Violeta, W. C. Huang, and T. Toda, “Investigating Self-supervised Pretraining Frameworks for Pathological Speech Recognition,” in Proc. Interspeech, 2022, pp. 41–45.
  12. J. Tobin and K. Tomanek, “Personalized automatic speech recognition trained on small disordered speech datasets,” in Proc. ICASSP, 2022, pp. 6637–6641.
  13. R. Sonobe, S. Takamichi, and H. Saruwatari, “JSUT corpus: free large-scale Japanese speech corpus for end-to-end speech synthesis,” 2017. [Online]. Available: https://arxiv.org/abs/1711.00354
  14. H. Kim, M. H. Johnson, J. Gunderson, A. Perlman, T. Huang, K. Watkin, S. Frame, H. V. Sharma, and X. Zhou, “Dysarthric speech database for universal access research,” in Proc. Interspeech, 2008, pp. 1741–1744.
  15. F. Rudzicz, A. Namasivayam, and T. Wolff, “The torgo database of acoustic and articulatory speech from speakers with dysarthria,” Language Resources and Evaluation, vol. 46, pp. 1–19, 01 2010.
  16. S. Watanabe, T. Hori, S. Karita, T. Hayashi, J. Nishitoba, Y. Unno, N. Enrique Yalta Soplin, J. Heymann, M. Wiesner, N. Chen, A. Renduchintala, and T. Ochiai, “ESPnet: End-to-End Speech Processing Toolkit,” in Proc. Interspeech, 2018, pp. 2207–2211.
  17. S. Watanabe, F. Boyer, X. Chang, P. Guo, T. Hayashi, Y. Higuchi, T. Hori, W.-C. Huang, H. Inaguma, N. Kamo, S. Karita, C. Li, J. Shi, A. Subramanian, and W. Zhang, “The 2020 espnet update: New features, broadened applications, performance improvements, and future plans,” 06 2021, pp. 1–6.
  18. A. Gulati, J. Qin, C.-C. Chiu, N. Parmar, Y. Zhang, J. Yu, W. Han, S. Wang, Z. Zhang, Y. Wu, and R. Pang, “Conformer: Convolution-augmented Transformer for Speech Recognition,” in Proc. Interspeech, 2020, pp. 5036–5040.
  19. A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin, “Attention is all you need,” Advances in neural information processing systems, vol. 30, 2017.
  20. P. Guo, F. Boyer, X. Chang, T. Hayashi, Y. Higuchi, H. Inaguma, N. Kamo, C. Li, D. Garcia-Romero, J. Shi, J. Shi, S. Watanabe, K. Wei, W. Zhang, and Y. Zhang, “Recent developments on espnet toolkit boosted by conformer,” in ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2021, pp. 5874–5878.
  21. J. Kong, J. Kim, and J. Bae, “HiFi-GAN: Generative adversarial networks for efficient and high fidelity speech synthesis,” Advances in NeurIPS, vol. 33, pp. 17 022–17 033, 2020.
  22. S. Ando and H. Fujihara, “Construction of a Large-Scale Japanese ASR Corpus on TV Recordings,” in Proc. ICASSP, 2021, pp. 6948–6952.
Citations (1)

Summary

We haven't generated a summary for this paper yet.