Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
119 tokens/sec
GPT-4o
56 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

ChildAugment: Data Augmentation Methods for Zero-Resource Children's Speaker Verification (2402.15214v1)

Published 23 Feb 2024 in eess.AS and cs.SD

Abstract: The accuracy of modern automatic speaker verification (ASV) systems, when trained exclusively on adult data, drops substantially when applied to children's speech. The scarcity of children's speech corpora hinders fine-tuning ASV systems for children's speech. Hence, there is a timely need to explore more effective ways of reusing adults' speech data. One promising approach is to align vocal-tract parameters between adults and children through children-specific data augmentation, referred here to as ChildAugment. Specifically, we modify the formant frequencies and formant bandwidths of adult speech to emulate children's speech. The modified spectra are used to train ECAPA-TDNN (emphasized channel attention, propagation, and aggregation in time-delay neural network) recognizer for children. We compare ChildAugment against various state-of-the-art data augmentation techniques for children's ASV. We also extensively compare different scoring methods, including cosine scoring, PLDA (probabilistic linear discriminant analysis), and NPLDA (neural PLDA). We also propose a low-complexity weighted cosine score for extremely low-resource children ASV. Our findings on the CSLU kids corpus indicate that ChildAugment holds promise as a simple, acoustics-motivated approach, for improving state-of-the-art deep learning based ASV for children. We achieve up to 12.45% (boys) and 11.96% (girls) relative improvement over the baseline.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (52)
  1. Bai, Z., and Zhang, X.-L. (2021). “Speaker recognition based on deep learning: An overview,” Neural Networks 140, 65–99.
  2. Batliner, A. et al. (2005). “The PF_STAR children’s speech corpus,” in Proc. Interspeech, pp. 2761–2764.
  3. Bishop, C. M. (2006). Pattern Recognition and Machine Learning (Springer-Verlag).
  4. Bittman, M., Rutherford, L., Brown, J., and Unsworth, L. (2011). “Digital natives? new and old media and children’s outcomes,” Australian Journal of Education 55, 161–175.
  5. Bousquet, P.-M., and Rouvier, M. (2019). “On robustness of unsupervised domain adaptation for speaker recognition,” in Proc. Interspeech, pp. 2958–2962.
  6. Chung, J. S. et al. (2018). “VoxCeleb2: Deep speaker recognition,” in Proc. Interspeech.
  7. Deng, J. et al. (2022). “Arcface: Additive angular margin loss for deep face recognition,” IEEE Transactions on Pattern Analysis and Machine Intelligence 5962–5979.
  8. Desplanques, B. et al. (2020). “ECAPA-TDNN: Emphasized channel attention, propagation and aggregation in TDNN based speaker verification,” in Proc. Interspeech.
  9. Graafland, J. H. (2018). “New technologies and 21st century children,” Organization for Economic Co-operation and Development (OECD) (179), www.oecd-ilibrary.org/content/paper/e071a505-en.
  10. Jaitly, N., and Hinton, G. E. (2013). “Vocal tract length perturbation (VTLP) improves speech recognition,” in Proc. ICML Workshop on Deep Learning for Audio, Speech and Language.
  11. Johnson, A. et al. (2022). “LPC augment: an LPC-based ASR data augmentation algorithm for low and zero-resource children’s dialects,” in Proc. IEEE ICASSP.
  12. Kaseva, T. et al. (2021). “Speaker verification experiments for adults and children using shared embedding spaces,” in Proc. Nordic Conference of Computational Linguistics.
  13. Kathania, H. K. et al. (2021). “Using data augmentation and time-scale modification to improve ASR of children’s speech in noisy environments,” Applied Sciences 11(18).
  14. Kathania, H. K. et al. (2022a). “Data augmentation using spectral warping for low resource children ASR,” Journal of Signal Processing Systems 94, 1507–1513.
  15. Kathania, H. K. et al. (2022b). “A formant modification method for improved ASR of children’s speech,” Speech Communication 136, 98–106.
  16. Kent, R. D., and Vorperian, H. K. (2018). “Static measurements of vowel formant frequencies and bandwidths: A review,” Journal of Communication Disorders 74, 74–79.
  17. Kingma, D. P., and Ba, J. (2014). “Adam: A method for stochastic optimization,” in Proc. ICLR .
  18. Ko, T. et al. (2015). “Audio augmentation for speech recognition,” in Proc. Interspeech.
  19. Ko, T. et al. (2017). “A study on data augmentation of reverberant speech for robust speech recognition,” in Proc. IEEE ICASSP.
  20. Kumar Kathania, H. et al. (2020). “Study of formant modification for children ASR,” in Proc. IEEE ICASSP.
  21. Laine, U., Karjalainen, M., and Altosaar, T. (1994). “Warped linear prediction (wlp) in speech and audio processing,” in Proc. ICASSP.
  22. Lee, S., Potamianos, A., and Narayanan, S. S. (1997). “Analysis of children’s speech, pitch and formant frequency,” The Journal of the Acoustical Society of America 101, 3194–3194.
  23. Lee, S., Potamianos, A., and Narayanan, S. S. (1999). “Acoustics of children’s speech: Developmental changes of temporal and spectral parameters,” The Journal of the Acoustical Society of America 105, 1455–1468.
  24. Loshchilov, I., and Hutter, F. (2019). “Decoupled weight decay regularization,” in Proc. ICLR.
  25. Makhoul, J. (1975). “Linear prediction: A tutorial review,” Proceedings of the IEEE 63(4), 561–580.
  26. Mustapha, A., and Yeldener, S. (1999). “An adaptive post-filtering technique based on the modified Yule-Walker filter,” in Proc. IEEE ICASSP.
  27. Nagrani, A. et al. (2017). “VoxCeleb: A large-scale speaker identification dataset,” in Proc. Interspeech.
  28. Nahar, R., Miwa, S., and Kai, A. (2022). “Domain adaptation with augmented data by deep neural network based method using re-recorded speech for automatic speech recognition in real environment,” Sensors 22(24).
  29. Park, D. S. et al. (2019). “SpecAugment: A simple data augmentation method for automatic speech recognition,” in Proc. Interspeech.
  30. Povey, D. et al. (2011). “The Kaldi speech recognition toolkit,” in Proc. IEEE ASRU.
  31. Pradhan, S. S., Cole, R. A., and Ward, W. H. (2023). “My science tutor (MyST)–a large corpus of children’s conversational speech,” arXiv preprint arXiv:2309.13347 .
  32. Ramirez, J. M., Montalvo, A., and Calvo, J. R. (2019). “A survey of the effects of data augmentation for automatic speech recognition systems,” in Progress in Pattern Recognition, Image Analysis, Computer Vision, and Applications, Springer International Publishing, pp. 669–678.
  33. Ramoji, S. et al. (2020). “NPLDA: A deep neural PLDA model for speaker verification,” in Proc. Odyssey 2020 The Speaker and Language Recognition Workshop.
  34. Ravanelli, M. et al. (2021). “SpeechBrain: A general-purpose speech toolkit” ArXiv:2106.04624.
  35. Safavi, S. et al. (2012). “Speaker recognition for children’s speech,” in Proc. Interspeech.
  36. Sarkar, A. K., and Tan, Z.-H. (2021). “Vocal tract length perturbation for text-dependent speaker verification with autoregressive prediction coding,” IEEE Signal Processing Letters 28, 364–368.
  37. Serizel, R., and Giuliani, D. (2014). “Vocal tract length normalisation approaches to DNN-based children’s and adults’ speech recognition,” in Proc. IEEE Spoken Language Technology Workshop (SLT).
  38. Shah, S. et al. (2020). “Voice conversion based data augmentation to improve children’s speech recognition in limited data scenario,” in Proc. Interspeech.
  39. Shahnawazuddin, S., Adiga, N., and Kathania, H. K. (2017a). “Effect of prosody modification on children’s ASR,” IEEE Signal Processing Letters 24(11), 1749–1753.
  40. Shahnawazuddin, S. et al. (2017b). “Pitch-normalized acoustic features for robust children’s speech recognition,” IEEE Signal Processing Letters 24, 1128–1132.
  41. Shahnawazuddin, S. et al. (2021). “Children’s speaker verification in low and zero resource conditions,” Digital Signal Processing 116, 103–115.
  42. Shivakumar, P. G., and Georgiou, P. (2020). “Transfer learning from adult to children for speech recognition: Evaluation, analysis and recommendations,” Computer speech & language 63, 101077.
  43. Shobaki, K., Hosom, J.-P., and Cole, R. (2000). “The OGI kids’ speech corpus and recognizers,” in Proc. Interspeech.
  44. Singh, V. P. et al. (2022). “Spectral modification based data augmentation for improving end-to-end ASR for children’s speech,” in Proc. Interspeech.
  45. Smith, B. (1992). “Relationships between duration and temporal variability in children’s speech,” The Journal of the Acoustical Society of America 91, 2165–74.
  46. Smith, B., Kenney, M., and Hussain, S. (1996). “A longitudinal investigation of duration and temporal variability in children’s speech production,” The Journal of the Acoustical Society of America 99, 2344–9.
  47. Smith, J. (1985). “Introduction to digital filter theory,” in Digital Audio Signal Processing: An Anthology, edited by J. Strawn (William Kaufmann, Los Altos, CA).
  48. Smith, L. N. (2017). “Cyclical learning rates for training neural networks,” in Proc. IEEE Winter Conference on Applications of Computer Vision (WACV).
  49. Snyder, D., Chen, G., and Povey, D. (2015). “MUSAN: A music, speech, and noise corpus,” arXiv preprint arXiv:1510.08484 .
  50. Snyder, D. et al. (2018). “X-vectors: Robust DNN embeddings for speaker recognition,” in Proc. IEEE ICASSP.
  51. Silero Team (2021). “Silero VAD: pre-trained enterprise-grade voice activity detector (VAD), number detector and language classifier” https://github.com/snakers4/silero-vad.
  52. Xiang, X. et al. (2019). “Margin matters: Towards more discriminative deep neural network embeddings for speaker recognition,” in Proc. IEEE APSIPA ASC.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (3)
  1. Vishwanath Pratap Singh (12 papers)
  2. Tomi Kinnunen (76 papers)
  3. Md Sahidullah (78 papers)
Citations (1)

Summary

We haven't generated a summary for this paper yet.