Synth4Kws: Synthesized Speech for User Defined Keyword Spotting in Low Resource Environments (2407.16840v1)
Abstract: One of the challenges in developing a high quality custom keyword spotting (KWS) model is the lengthy and expensive process of collecting training data covering a wide range of languages, phrases and speaking styles. We introduce Synth4Kws - a framework to leverage Text to Speech (TTS) synthesized data for custom KWS in different resource settings. With no real data, we found increasing TTS phrase diversity and utterance sampling monotonically improves model performance, as evaluated by EER and AUC metrics over 11k utterances of the speech command dataset. In low resource settings, with 50k real utterances as a baseline, we found using optimal amounts of TTS data can improve EER by 30.1% and AUC by 46.7%. Furthermore, we mix TTS data with varying amounts of real data and interpolate the real data needed to achieve various quality targets. Our experiments are based on English and single word utterances but the findings generalize to i18n languages and other keyword types.
- G. Chen, C. Parada, and G. Heigold, “Small-footprint keyword spotting using deep neural networks,” in 2014 IEEE international conference on acoustics, speech and signal processing (ICASSP). IEEE, 2014, pp. 4087–4091.
- R. Alvarez and H.-J. Park, “End-to-end streaming keyword spotting,” in ICASSP 2019-2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2019, pp. 6336–6340.
- P. Zhu, H. J. Park, A. Park, A. S. Scarpati, and I. L. Moreno, “Locale encoding for scalable multilingual keyword spotting models,” in ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2023, pp. 1–5.
- R. Prabhavalkar, Z. Meng, W. Wang, A. Stooke, X. Cai, Y. He, A. Narayanan, D. Hwang, T. N. Sainath, and P. J. Moreno, “Extreme encoder output frame rate reduction: Improving computational latencies of large end-to-end models,” arXiv preprint arXiv:2402.17184, 2024.
- J. Lin, K. Kilgour, D. Roblek, and M. Sharifi, “Training keyword spotters with limited and synthesized speech data,” in ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2020, pp. 7474–7478.
- O. Rybakov, N. Kononenko, N. Subrahmanya, M. Visontai, and S. Laurenzo, “Streaming keyword spotting on mobile devices,” arXiv preprint arXiv:2005.06720, 2020.
- N. Sacchi, A. Nanchen, M. Jaggi, and M. Cernak, “Open-vocabulary keyword spotting with audio and text embeddings,” in INTERSPEECH 2019-IEEE International Conference on Acoustics, Speech, and Signal Processing, 2019.
- R. Chidhambararajan, A. Rangapur, S. Sibi Chakkaravarthy, A. K. Cherukuri, M. V. Cruz, and S. S. Ilango, “Efficientword-net: An open source hotword detection engine based on few-shot learning,” Journal of Information & Knowledge Management, vol. 21, no. 04, p. 2250059, 2022.
- F. Schroff, D. Kalenichenko, and J. Philbin, “Facenet: A unified embedding for face recognition and clustering,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2015, pp. 815–823.
- D. B. Paul and J. Baker, “The design for the wall street journal-based csr corpus,” in Speech and Natural Language: Proceedings of a Workshop Held at Harriman, New York, February 23-26, 1992, 1992.
- Z. Chen, Y. Zhang, A. Rosenberg, B. Ramabhadran, P. Moreno, A. Bapna, and H. Zen, “Maestro: Matched speech text representations through modality matching,” arXiv preprint arXiv:2204.03409, 2022.
- T. Saeki, H. Zen, Z. Chen, N. Morioka, G. Wang, Y. Zhang, A. Bapna, A. Rosenberg, and B. Ramabhadran, “Virtuoso: Massive multilingual speech-text joint semi-supervised learning for text-to-speech,” in ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2023, pp. 1–5.
- T. Saeki, G. Wang, N. Morioka, I. Elias, K. Kastner, A. Rosenberg, B. Ramabhadran, H. Zen, F. Beaufays, and H. Shemtov, “Extending multilingual speech synthesis to 100+ languages without transcribed data,” arXiv preprint arXiv:2402.18932, 2024.
- J. Effendi, Y. Virkar, R. Barra-Chicote, and M. Federico, “Duration modeling of neural tts for automatic dubbing,” in ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2022, pp. 8037–8041.
- M. Zhang, Y. Zhou, L. Zhao, and H. Li, “Transfer learning from speech synthesis to voice conversion with non-parallel training data,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 29, pp. 1290–1302, 2021.
- G. Wang, A. Rosenberg, Z. Chen, Y. Zhang, B. Ramabhadran, Y. Wu, and P. Moreno, “Improving speech recognition using consistent predictions on synthesized speech,” in ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2020, pp. 7029–7033.
- Y. Huang, Y. Chen, J. Pelecanos, and Q. Wang, “Synth2aug: Cross-domain speaker recognition with tts synthesized speech,” in 2021 IEEE Spoken Language Technology Workshop (SLT). IEEE, 2021, pp. 316–322.
- S. Liu, A. Zhang, K. Huang, and L. Xie, “Leveraging synthetic speech for cif-based customized keyword spotting,” in National Conference on Man-Machine Speech Communication. Springer, 2023, pp. 354–365.
- J. Lim and Y. Baek, “User-defined keyword spotting utilizing speech synthesis for low-resource wearable devices,” in 2022 IEEE International Conference on Consumer Electronics (ICCE). IEEE, 2022, pp. 01–02.
- S. Hochreiter and J. Schmidhuber, “Long short-term memory,” Neural computation, vol. 9, no. 8, pp. 1735–1780, 1997.
- A. Gulati, J. Qin, C.-C. Chiu, N. Parmar, Y. Zhang, J. Yu, W. Han, S. Wang, Z. Zhang, Y. Wu et al., “Conformer: Convolution-augmented transformer for speech recognition,” arXiv preprint arXiv:2005.08100, 2020.
- L. Wan, Q. Wang, A. Papir, and I. L. Moreno, “Generalized end-to-end loss for speaker verification,” in 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2018, pp. 4879–4883.
- M. Mazumder, S. Chitlangia, C. Banbury, Y. Kang, J. M. Ciro, K. Achorn, D. Galvez, M. Sabini, P. Mattson, D. Kanter et al., “Multilingual spoken words corpus,” in Thirty-fifth Conference on Neural Information Processing Systems Datasets and Benchmarks Track (Round 2), 2021.
- J. Shen et al., “Lingvo: a modular and scalable framework for sequence-to-sequence modeling,” ArXiv, vol. abs/1902.08295, 2019.
- P. Warden, “Speech commands: A dataset for limited-vocabulary speech recognition,” arXiv preprint arXiv:1804.03209, 2018.
Sponsor
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Top Community Prompts
Collections
Sign up for free to add this paper to one or more collections.