Analysis of Self-Supervised Speech Models on Children's Speech and Infant Vocalizations (2402.06888v2)
Abstract: To understand why self-supervised learning (SSL) models have empirically achieved strong performances on several speech-processing downstream tasks, numerous studies have focused on analyzing the encoded information of the SSL layer representations in adult speech. Limited work has investigated how pre-training and fine-tuning affect SSL models encoding children's speech and vocalizations. In this study, we aim to bridge this gap by probing SSL models on two relevant downstream tasks: (1) phoneme recognition (PR) on the speech of adults, older children (8-10 years old), and younger children (1-4 years old), and (2) vocalization classification (VC) distinguishing cry, fuss, and babble for infants under 14 months old. For younger children's PR, the superiority of fine-tuned SSL models is largely due to their ability to learn features that represent older children's speech and then adapt those features to the speech of younger children. For infant VC, SSL models pre-trained on large-scale home recordings learn to leverage phonetic representations at middle layers, and thereby enhance the performance of this task.
- Y. Gong, H. Yatawatte, C. Poellabauer, S. Schneider, and S. Latham, “Automatic autism spectrum disorder detection using everyday vocalizations captured by smart devices,” in Proceedings of the 2018 ACM international conference on bioinformatics, computational biology, and health informatics, 2018, pp. 465–473.
- E. I. Toki, V. Zakopoulou, G. Tatsis, K. Plachouras, V. Siafaka, E. I. Kosma, S. K. Chronopoulos et al., “A game-based smart system identifying developmental speech and language disorders in child communication: A protocol towards digital clinical diagnostic procedures,” in Interactive Mobile Communication, Technologies and Learning. Springer, 2021, pp. 559–568.
- A. Hagen, B. Pellom, and R. Cole, “Children’s speech recognition with application to interactive books and tutors,” in 2003 IEEE Workshop on Automatic Speech Recognition and Understanding (IEEE Cat. No. 03EX721). IEEE, 2003, pp. 186–191.
- A. Baevski, H. Zhou, A. Mohamed, and M. Auli, “wav2vec 2.0: a framework for self-supervised learning of speech representations,” in Proceedings of the 34th International Conference on NeurIPS Systems, 2020, pp. 12 449–12 460.
- W.-N. Hsu, B. Bolte, Y.-H. H. Tsai, K. Lakhotia, R. Salakhutdinov, and A. Mohamed, “Hubert: Self-supervised speech representation learning by masked prediction of hidden units,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 29, pp. 3451–3460, 2021.
- R. Fan and A. Alwan, “DRAFT: A Novel Framework to Reduce Domain Shifting in Self-supervised Learning and Its Application to Children’s ASR,” in Proc. Interspeech 2022, 2022, pp. 4900–4904.
- R. Fan, Y. Zhu, J. Wang, and A. Alwan, “Towards better domain adaptation for self-supervised models: A case study of child ASR,” IEEE Journal of Selected Topics in Signal Processing, vol. 16, no. 6, pp. 1242–1252, 2022.
- R. Jain, A. Barcovschi, M. Y. Yiwere, D. Bigioi, P. Corcoran, and H. Cucu, “A wav2vec2-based experimental study on self-supervised learning methods to improve child speech recognition,” IEEE Access, vol. 11, pp. 46 938–46 948, 2023.
- R. Lahiri, T. Feng, R. Hebbar, C. Lord, S. H. Kim, and S. Narayanan, “Robust Self Supervised Speech Embeddings for Child-Adult Classification in Interactions involving Children with Autism,” in Proc. INTERSPEECH 2023, 2023, pp. 3557–3561.
- A. Gorin, C. Subakan, S. Abdoli, J. Wang, S. Latremouille, and C. C. Onu, “Self-supervised learning for infant cry analysis,” 2023 IEEE International Conference on Acoustics, Speech, and Signal Processing Workshops (ICASSPW), pp. 1–5, 2023.
- A. Xu, R. Hebbar, R. Lahiri, T. Feng, L. Butler, L. Shen, H. Tager-Flusberg, and S. Narayanan, “Understanding Spoken Language Development of Children with ASD Using Pre-trained Speech Embeddings,” in Proc. INTERSPEECH 2023, 2023, pp. 4633–4637.
- J. Li, M. Hasegawa-Johnson, and N. L. McElwain, “Towards Robust Family-Infant Audio Analysis Based on Unsupervised Pretraining of Wav2vec 2.0 on Large-Scale Unlabeled Family Audio,” in Proc. INTERSPEECH 2023, 2023, pp. 1035–1039.
- A. Pasad, J.-C. Chou, and K. Livescu, “Layer-wise analysis of a self-supervised speech representation model,” in 2021 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU). IEEE, 2021, pp. 914–921.
- Y. Li, Y. Mohamied, P. Bell, and C. Lai, “Exploration of a self-supervised speech model: A study on emotional corpora,” in 2022 IEEE Spoken Language Technology Workshop (SLT), pp. 868–875.
- M. Yang, R. C. M. C. Shekar, O. Kang, and J. H. L. Hansen, “What Can an Accent Identifier Learn? Probing Phonetic and Prosodic Information in a Wav2vec2-based Accent Identification Model,” in Proc. INTERSPEECH 2023, 2023, pp. 1923–1927.
- G.-T. Lin, C.-L. Feng, W.-P. Huang, Y. Tseng, T.-H. Lin, C.-A. Li, H.-y. Lee, and N. G. Ward, “On the utility of self-supervised models for prosody-related tasks,” in 2022 IEEE Spoken Language Technology Workshop (SLT). IEEE, 2023, pp. 1104–1111.
- M. Lavechin, Y. Sy, H. Titeux, M. A. C. Blandón, O. Räsänen, H. Bredin, E. Dupoux, and A. Cristia, “BabySLM: language-acquisition-friendly benchmark of self-supervised spoken language models,” in Proc. INTERSPEECH 2023, 2023, pp. 4588–4592.
- G. Yeung and A. Alwan, “On the difficulties of automatic speech recognition for kindergarten-aged children,” Interspeech 2018, 2018.
- V. M. Shetty, S. M. Lulich, and A. Alwan, “Developmental Articulatory and Acoustic Features for Six to Ten Year Old Children,” in Proc. INTERSPEECH 2023, 2023, pp. 4598–4602.
- A. Potamianos and S. Narayanan, “A review of the acoustic and linguistic properties of children’s speech,” in 2007 IEEE 9th Workshop on Multimedia Signal Processing. IEEE, 2007, pp. 22–25.
- V. Panayotov, G. Chen, D. Povey, and S. Khudanpur, “Librispeech: An ASR corpus based on public domain audio books,” in 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2015, pp. 5206–5210.
- W. Ward, R. Cole, D. Bolanos, C. Buchenroth-Martin, E. Svirsky, S. V. Vuuren, T. Weston, J. Zheng, and L. Becker, “My science tutor: A conversational multimedia virtual tutor for elementary school science,” ACM Transactions on Speech and Language Processing (TSLP), vol. 7, no. 4, pp. 1–29, 2011.
- K. Demuth, J. Culbertson, and J. Alter, “Word-minimality, epenthesis and coda licensing in the early acquisition of english,” Language and speech, vol. 49, no. 2, pp. 137–173, 2006.
- M. Ravanelli, T. Parcollet, P. Plantinga, A. Rouhe, S. Cornell, L. Lugosch, C. Subakan, N. Dawalatabad, A. Heba, J. Zhong et al., “Speechbrain: A general-purpose speech toolkit,” arXiv preprint arXiv:2106.04624, 2021.
- A. Graves, S. Fernández, F. Gomez, and J. Schmidhuber, “Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks,” in Proceedings of the 23rd international conference on Machine learning, 2006, pp. 369–376.
- N. McElwain, B. Islam, M. Fisher, C. Nebeker, J. M. Bodway, and M. Hasegawa-Johnson, “Evaluating users’ experiences of a child multimodal wearable device: A mixed methods approach,” JMIR Human Factors, 2023, in press.
- J. Gilkerson et al., “The lena natural language study,” Boulder, CO: LENA Foundation. Retrieved March, vol. 3, p. 2009, 2008.
- G. Andrew, R. Arora, J. Bilmes, and K. Livescu, “Deep canonical correlation analysis,” in Proceedings of the 30th International Conference on Machine Learning, vol. 28, no. 3, 17–19 Jun 2013, pp. 1247–1255.
- A. S. Morcos, M. Raghu, and S. Bengio, “Insights on representational similarity in neural networks with canonical correlation,” in Neural Information Processing Systems, 2018. [Online]. Available: https://api.semanticscholar.org/CorpusID:49271358
- M. McAuliffe, M. Socolof, S. Mihuc, M. Wagner, and M. Sonderegger, “Montreal forced aligner: Trainable text-speech alignment using kaldi.” in Interspeech, vol. 2017, 2017, pp. 498–502.
- L. Van der Maaten and G. Hinton, “Visualizing data using t-sne.” Journal of machine learning research, vol. 9, no. 11, 2008.
- D. Pallet, W. Fisher, and J. Fiscus, “Tools for the analysis of benchmark speech recognition tests,” in International Conference on Acoustics, Speech, and Signal Processing, 1990, pp. 97–100 vol.1.
- F. Eyben, M. Wöllmer, and B. Schuller, “Opensmile: the munich versatile and fast open-source audio feature extractor,” in Proceedings of the 18th ACM international conference on Multimedia, 2010, pp. 1459–1462.
- F. Eyben, K. R. Scherer, B. W. Schuller, J. Sundberg, E. André, C. Busso, L. Y. Devillers, J. Epps, P. Laukka, S. S. Narayanan et al., “The geneva minimalistic acoustic parameter set (gemaps) for voice research and affective computing,” IEEE transactions on affective computing, vol. 7, no. 2, pp. 190–202, 2015.
- D. B. Fox, “An analysis of the pitch characteristics of infant vocalizations.” Psychomusicology: A Journal of Research in Music Cognition, vol. 9, no. 1, p. 21, 1990.
- Jialu Li (53 papers)
- Mark Hasegawa-Johnson (62 papers)
- Nancy L. McElwain (5 papers)