End-to-End real time tracking of children's reading with pointer network
Abstract: In this work, we explore how a real time reading tracker can be built efficiently for children's voices. While previously proposed reading trackers focused on ASR-based cascaded approaches, we propose a fully end-to-end model making it less prone to lags in voice tracking. We employ a pointer network that directly learns to predict positions in the ground truth text conditioned on the streaming speech. To train this pointer network, we generate ground truth training signals by using forced alignment between the read speech and the text being read on the training set. Exploring different forced alignment models, we find a neural attention based model is at least as close in alignment accuracy to the Montreal Forced Aligner, but surprisingly is a better training signal for the pointer network. Our results are reported on one adult speech data (TIMIT) and two children's speech datasets (CMU Kids and Reading Races). Our best model can accurately track adult speech with 87.8% accuracy and the much harder and disfluent children's speech with 77.1% accuracy on CMU Kids data and a 65.3% accuracy on the Reading Races dataset.
- Marilyn Jager Adams, “The promise of automatic speech recognition for fostering literacy growth in children and adults,” in International handbook of literacy and technology, pp. 109–128. Routledge, 2013.
- “A prototype reading coach that listens,” in Proceedings of the Twelfth AAAI National Conference on Artificial Intelligence, 1994, pp. 785–792.
- Jack Mostow, “Why and how our automated reading tutor listens,” in Proceedings of the International Symposium on Automatic Detection of Errors in Pronunciation Training (ISADEPT). KTH Stockholm, Sweden, 2012, pp. 43–52.
- “Automatic detection and classification of disfluent reading miscues in young children’s speech for the purpose of assessment,” in Eighth Annual Conference of the International Speech Communication Association, 2007.
- “Transformer based end-to-end mispronunciation detection and diagnosis.,” in Interspeech, 2021, pp. 3954–3958.
- “End-to-end word-level disfluency detection and classification in children’s reading assessment,” in ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2023, pp. 1–5.
- “Evaluating tracking accuracy of an automatic reading tutor,” in Speech and Language Technology in Education, 2011.
- “Evaluating and improving real-time tracking of children’s oral reading.,” in FLAIRS Conference, 2012.
- “Towards real-time mispronunciation detection in kids’ speech,” in 2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU). IEEE, 2019, pp. 690–696.
- “Pointer networks,” Advances in neural information processing systems, vol. 28, 2015.
- “Listen, attend and spell: A neural network for large vocabulary conversational speech recognition,” in 2016 IEEE international conference on acoustics, speech and signal processing (ICASSP). IEEE, 2016, pp. 4960–4964.
- “Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks,” in Proceedings of the 23rd international conference on Machine learning, 2006, pp. 369–376.
- “Montreal forced aligner: Trainable text-speech alignment using kaldi.,” in Interspeech, 2017, vol. 2017, pp. 498–502.
- John S Garofolo, “Timit acoustic phonetic continuous speech corpus,” Linguistic Data Consortium, 1993, 1993.
- “The cmu kids corpus,” in Linguistic Data Consortium, no. 11. LDC, 1997.
- “Improving reading within an urban elementary school: computerized intervention and paraprofessional factors,” Preventing School Failure: Alternative Education for Children and Youth, vol. 63, no. 2, pp. 162–174, 2019.
- “Ctc-segmentation of large corpora for german end-to-end speech recognition,” in International Conference on Speech and Computer. Springer, 2020, pp. 267–278.
- “Distilling the knowledge in a neural network,” arXiv preprint arXiv:1503.02531, 2015.
- “Librispeech: an asr corpus based on public domain audio books,” in 2015 IEEE international conference on acoustics, speech and signal processing (ICASSP). IEEE, 2015, pp. 5206–5210.
- “The sphinx-ii speech recognition system: an overview,” Computer Speech & Language, vol. 7, no. 2, pp. 137–148, 1993.
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Top Community Prompts
Collections
Sign up for free to add this paper to one or more collections.