Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
194 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
46 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Acoustic modeling for Overlapping Speech Recognition: JHU Chime-5 Challenge System (2405.11078v1)

Published 17 May 2024 in eess.AS

Abstract: This paper summarizes our acoustic modeling efforts in the Johns Hopkins University speech recognition system for the CHiME-5 challenge to recognize highly-overlapped dinner party speech recorded by multiple microphone arrays. We explore data augmentation approaches, neural network architectures, front-end speech dereverberation, beamforming and robust i-vector extraction with comparisons of our in-house implementations and publicly available tools. We finally achieved a word error rate of 69.4% on the development set, which is a 11.7% absolute improvement over the previous baseline of 81.1%, and release this improved baseline with refined techniques/tools as an advanced CHiME-5 recipe.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (26)
  1. “Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks,” in Proc. ICML. ACM, 2006, pp. 369–376.
  2. “Purely Sequence-Trained Neural Networks for ASR Based on Lattice-Free MMI,” in Proc. INTERSPEECH, 2016, pp. 2751–2755.
  3. “Language modeling with highway LSTM,” in Proc. ASRU. IEEE, 2017, pp. 244–251.
  4. “The CAPIO 2017 conversational speech recognition system,” arXiv preprint arXiv:1801.00059, 2017.
  5. “The microsoft 2017 conversational speech recognition system,” in Proc. ICASSP. IEEE, 2018, pp. 5934–5938.
  6. “The fifth’chime’speech separation and recognition challenge: Dataset, task and baselines,” arXiv preprint arXiv:1803.10609, 2018.
  7. “The REVERB challenge: A common evaluation framework for dereverberation and recognition of reverberant speech,” in IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), 2013, pp. 1–4.
  8. Mary Harper, “The automatic speech recogition in reverberant environments (ASpIRE) challenge,” in IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU), 2015, pp. 547–554.
  9. “An analysis of environment, microphone and data simulation mismatches in robust speech recognition,” Computer Speech & Language, vol. 46, pp. 535–557, 2017.
  10. “Far-field ASR without parallel data,” in Proc. INTERSPEECH, 2016.
  11. “The Hitachi/JHU CHiME-5 system: Advances in speech recognition for everyday home environments using multiple microphone arrays,” in Speech Processing in Everyday Environments, The 5th International Workshop on, 2018.
  12. “NARA-WPE: A Python package for weighted prediction error dereverberation in Numpy and Tensorflow for online and offline processing,” in 13. ITG Fachtagung Sprachkommunikation (ITG 2018), Oct 2018.
  13. Emanuel AP Habets, “Room impulse response generator,” Technische Universiteit Eindhoven, Tech. Rep, vol. 2, no. 2.4, pp. 1, 2006.
  14. J. Allen and D. Berkeley, “Image method for efficiently simulating small room acoustics,” Journal of the Acoustical Society of America, vol. 65, no. 4, pp. 943–950, 1979.
  15. “A study on data augmentation of reverberant speech for robust speech recognition,” in Acoustics, Speech and Signal Processing (ICASSP), 2017 IEEE International Conference on. IEEE, 2017, pp. 5220–5224.
  16. “Musan: A music, speech, and noise corpus,” arXiv preprint arXiv:1510.08484, 2015.
  17. “The Kaldi speech recognition toolkit,” in Proc. ASRU, 2011, number EPFL-CONF-192584.
  18. “Jhu aspire system: Robust lvcsr with tdnns, ivector adaptation and rnn-lms,” in Proc. ASRU. IEEE, 2015, pp. 539–546.
  19. “Batch normalization: Accelerating deep network training by reducing internal covariate shift,” arXiv preprint arXiv:1502.03167, 2015.
  20. “Dropout: a simple way to prevent neural networks from overfitting.,” Journal of machine learning research, vol. 15, no. 1, pp. 1929–1958, 2014.
  21. “An exploration of dropout with lstms,” in Proc. Interspeech, 2017.
  22. “Long short-term memory recurrent neural network architectures for large scale acoustic modeling,” in Fifteenth Annual Conference of the International Speech Communication Association, 2014.
  23. “Semi-orthogonal low-rank matrix factorization for deep neural networks,” in Proc. INTERSPEECH, 2018.
  24. “Front-end factor analysis for speaker verification,” IEEE Transactions on Audio, Speech, and Language Processing, vol. 19, no. 4, pp. 788–798, 2011.
  25. “iVector-based discriminative adaptation for automatic speech recognition,” in Proc. Automatic Speech Recognition and Understanding (ASRU), 2011 IEEE Workshop on. Dec. 2011, pp. 152–157, IEEE.
  26. “Speech dereverberation based on variance-normalized delayed linear prediction,” IEEE Transactions on Audio, Speech, and Language Processing, vol. 18, no. 7, pp. 1717–1731, 2010.
Citations (20)

Summary

We haven't generated a summary for this paper yet.