2000 character limit reached
Acoustic modeling for Overlapping Speech Recognition: JHU Chime-5 Challenge System (2405.11078v1)
Published 17 May 2024 in eess.AS
Abstract: This paper summarizes our acoustic modeling efforts in the Johns Hopkins University speech recognition system for the CHiME-5 challenge to recognize highly-overlapped dinner party speech recorded by multiple microphone arrays. We explore data augmentation approaches, neural network architectures, front-end speech dereverberation, beamforming and robust i-vector extraction with comparisons of our in-house implementations and publicly available tools. We finally achieved a word error rate of 69.4% on the development set, which is a 11.7% absolute improvement over the previous baseline of 81.1%, and release this improved baseline with refined techniques/tools as an advanced CHiME-5 recipe.
- “Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks,” in Proc. ICML. ACM, 2006, pp. 369–376.
- “Purely Sequence-Trained Neural Networks for ASR Based on Lattice-Free MMI,” in Proc. INTERSPEECH, 2016, pp. 2751–2755.
- “Language modeling with highway LSTM,” in Proc. ASRU. IEEE, 2017, pp. 244–251.
- “The CAPIO 2017 conversational speech recognition system,” arXiv preprint arXiv:1801.00059, 2017.
- “The microsoft 2017 conversational speech recognition system,” in Proc. ICASSP. IEEE, 2018, pp. 5934–5938.
- “The fifth’chime’speech separation and recognition challenge: Dataset, task and baselines,” arXiv preprint arXiv:1803.10609, 2018.
- “The REVERB challenge: A common evaluation framework for dereverberation and recognition of reverberant speech,” in IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), 2013, pp. 1–4.
- Mary Harper, “The automatic speech recogition in reverberant environments (ASpIRE) challenge,” in IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU), 2015, pp. 547–554.
- “An analysis of environment, microphone and data simulation mismatches in robust speech recognition,” Computer Speech & Language, vol. 46, pp. 535–557, 2017.
- “Far-field ASR without parallel data,” in Proc. INTERSPEECH, 2016.
- “The Hitachi/JHU CHiME-5 system: Advances in speech recognition for everyday home environments using multiple microphone arrays,” in Speech Processing in Everyday Environments, The 5th International Workshop on, 2018.
- “NARA-WPE: A Python package for weighted prediction error dereverberation in Numpy and Tensorflow for online and offline processing,” in 13. ITG Fachtagung Sprachkommunikation (ITG 2018), Oct 2018.
- Emanuel AP Habets, “Room impulse response generator,” Technische Universiteit Eindhoven, Tech. Rep, vol. 2, no. 2.4, pp. 1, 2006.
- J. Allen and D. Berkeley, “Image method for efficiently simulating small room acoustics,” Journal of the Acoustical Society of America, vol. 65, no. 4, pp. 943–950, 1979.
- “A study on data augmentation of reverberant speech for robust speech recognition,” in Acoustics, Speech and Signal Processing (ICASSP), 2017 IEEE International Conference on. IEEE, 2017, pp. 5220–5224.
- “Musan: A music, speech, and noise corpus,” arXiv preprint arXiv:1510.08484, 2015.
- “The Kaldi speech recognition toolkit,” in Proc. ASRU, 2011, number EPFL-CONF-192584.
- “Jhu aspire system: Robust lvcsr with tdnns, ivector adaptation and rnn-lms,” in Proc. ASRU. IEEE, 2015, pp. 539–546.
- “Batch normalization: Accelerating deep network training by reducing internal covariate shift,” arXiv preprint arXiv:1502.03167, 2015.
- “Dropout: a simple way to prevent neural networks from overfitting.,” Journal of machine learning research, vol. 15, no. 1, pp. 1929–1958, 2014.
- “An exploration of dropout with lstms,” in Proc. Interspeech, 2017.
- “Long short-term memory recurrent neural network architectures for large scale acoustic modeling,” in Fifteenth Annual Conference of the International Speech Communication Association, 2014.
- “Semi-orthogonal low-rank matrix factorization for deep neural networks,” in Proc. INTERSPEECH, 2018.
- “Front-end factor analysis for speaker verification,” IEEE Transactions on Audio, Speech, and Language Processing, vol. 19, no. 4, pp. 788–798, 2011.
- “iVector-based discriminative adaptation for automatic speech recognition,” in Proc. Automatic Speech Recognition and Understanding (ASRU), 2011 IEEE Workshop on. Dec. 2011, pp. 152–157, IEEE.
- “Speech dereverberation based on variance-normalized delayed linear prediction,” IEEE Transactions on Audio, Speech, and Language Processing, vol. 18, no. 7, pp. 1717–1731, 2010.