Two-pass Decoding and Cross-adaptation Based System Combination of End-to-end Conformer and Hybrid TDNN ASR Systems (2206.11596v1)
Abstract: Fundamental modelling differences between hybrid and end-to-end (E2E) automatic speech recognition (ASR) systems create large diversity and complementarity among them. This paper investigates multi-pass rescoring and cross adaptation based system combination approaches for hybrid TDNN and Conformer E2E ASR systems. In multi-pass rescoring, state-of-the-art hybrid LF-MMI trained CNN-TDNN system featuring speed perturbation, SpecAugment and Bayesian learning hidden unit contributions (LHUC) speaker adaptation was used to produce initial N-best outputs before being rescored by the speaker adapted Conformer system using a 2-way cross system score interpolation. In cross adaptation, the hybrid CNN-TDNN system was adapted to the 1-best output of the Conformer system or vice versa. Experiments on the 300-hour Switchboard corpus suggest that the combined systems derived using either of the two system combination approaches outperformed the individual systems. The best combined system obtained using multi-pass rescoring produced statistically significant word error rate (WER) reductions of 2.5% to 3.9% absolute (22.5% to 28.9% relative) over the stand alone Conformer system on the NIST Hub5'00, Rt03 and Rt02 evaluation data.
- K. Veselỳ et al., “Sequence-discriminative training of deep neural networks.” in INTERSPEECH, 2013.
- D. Povey et al., “Semi-orthogonal low-rank matrix factorization for deep neural networks.” in INTERSPEECH, 2018.
- O. Abdel-Hamid et al., “Applying convolutional neural networks concepts to hybrid nn-hmm model for speech recognition,” in ICASSP, 2012.
- L. R. Medsker et al., “Recurrent neural networks,” Design and Applications, 2001.
- A. Graves et al., “Speech recognition with deep recurrent neural networks,” in ICASSP, 2013.
- H. Sak et al., “Long short-term memory based recurrent neural network architectures for large vocabulary speech recognition,” arXiv preprint arXiv:1402.1128, 2014.
- V. Peddinti et al., “A time delay neural network architecture for efficient modeling of long temporal contexts,” in ISCA, 2015.
- D. Povey et al., “Purely sequence-trained neural networks for asr based on lattice-free mmi.” in INTERSPEECH, 2016.
- W. Chan et al., “Listen, attend and spell: A neural network for large vocabulary conversational speech recognition,” in ICASSP, 2016.
- A. Graves, “Sequence transduction with recurrent neural networks,” arXiv preprint arXiv:1211.3711, 2012.
- A. Vaswani et al., “Attention is all you need,” Advances in neural information processing systems, 2017.
- L. Dong et al., “Speech-transformer: a no-recurrence sequence-to-sequence model for speech recognition,” in ICASSP, 2018.
- A. Gulati et al., “Conformer: Convolution-augmented transformer for speech recognition,” arXiv preprint arXiv:2005.08100, 2020.
- P. Guo et al., “Recent developments on espnet toolkit boosted by conformer,” in ICASSP, 2021.
- S. Karita et al., “A comparative study on transformer vs rnn in speech applications,” in ASRU, 2019.
- P. Woodland et al., “Superears: Multi-site broadcast news system,” in Rich Transcription (RT-04F) Workshop, 2004.
- R. Schwartz et al., “Speech recognition in multiple languages and domains: the 2003 bbn/limsi ears system,” in ICASSP, 2004.
- X. Lei et al., “Development of the 2008 sri mandarin speech-to-text system for broadcast news and conversation,” in ISCA, 2009.
- S. M. Chu et al., “The 2009 ibm gale mandarin broadcast transcription system,” in ICASSP, 2010.
- X. Liu et al., “Language model cross adaptation for lvcsr system combination,” CSL, 2013.
- L. Lamel et al., “Improved models for mandarin speech-to-text transcription,” in ICASSP, 2011.
- P. C. Woodland et al., “The 1994 htk large vocabulary speech recognition system,” in ICASSP, 1995.
- T. Hain et al., “Automatic transcription of conversational telephone speech,” IEEE/ACM TSALP, 2005.
- T. Hain et al., “Automatic transcription of conversational telephone speech-development of the cu-htk 2002 system,” in ICASSP, 2003.
- B. Peskin et al., “Improvements in recognition of conversational telephone speech,” in ICASSP, 1999.
- R. Prasad et al., “The 2004 bbn/limsi 20xrt english conversational telephone speech recognition system,” in ECSCT, 2005.
- J. G. Fiscus, “A post-processing system to yield reduced word error rates: Recognizer output voting error reduction (rover),” in ASRUP Workshop, 1997.
- G. Evermann et al., “Posterior probability decoding, confidence estimation and system combination,” in Proc. Speech Transcription Workshop, 2000.
- S. Watanabe et al., “Hybrid ctc/attention architecture for end-to-end speech recognition,” Selected Topics in Signal Processing, 2017.
- T. N. Sainath et al., “Two-pass end-to-end speech recognition,” arXiv preprint arXiv:1908.10992, 2019.
- Q. Li et al., “Integrating source-channel and attention-based sequence-to-sequence models for speech recognition,” in ASRU, 2019.
- D. S. Park et al., “Specaugment: A simple data augmentation method for automatic speech recognition,” arXiv preprint arXiv:1904.08779, 2019.
- X. Xie et al., “Blhuc: Bayesian learning of hidden unit contributions for deep neural network speaker adaptation,” in ICASSP, 2019.
- J. H. Wong et al., “Combination of end-to-end and hybrid models for speech recognition.” in INTERSPEECH, 2020.
- X. Xie et al., “Bayesian learning for deep neural network adaptation,” IEEE/ACM TASLP, 2021.
- J. Deng et al., “Confidence score based conformer speaker adaptation for speech recognition,” INTERSPEECH, 2022.
- Y. N. Dauphin et al., “Language modeling with gated convolutional networks,” in ICML, 2017.
- P. Swietojanski et al., “Learning hidden unit contributions for unsupervised speaker adaptation of neural network acoustic models,” SLT Workshop, 2014.
- P. Swietojanski et al., “Learning hidden unit contributions for unsupervised acoustic model adaptation,” IEEE/ACM TASLP, 2016.
- C. Zhang et al., “Dnn speaker adaptation using parameterised sigmoid and relu hidden activation functions,” ICASSP, 2016.
- Z. Huang et al., “Bayesian unsupervised batch and online speaker adaptation of activation function parameters in deep models for automatic speech recognition,” IEEE/ACM TASLP, 2017.
- A. Gulati et al., “Conformer: Convolution-augmented transformer for speech recognition,” INTERSPEECH, 2020.
- D. Povey et al., “The kaldi speech recognition toolkit,” in Workshop on ASRU, 2011.
- J. J. Odell et al., “A one pass decoder design for large vocabulary recognition,” in Human Language Technology, 1994.
- H. Ney et al., “Dynamic programming search for continuous speech recognition,” Signal Processing Magazine, 1999.
- S. Ortmanns et al., “Look-ahead techniques for fast beam search,” CSL, 2000.
- D. Rybach et al., “Lexical prefix tree and wfst: A comparison of two dynamic search concepts for lvcsr,” IEEE/ACM TASLP, 2013.
- M. Mohri et al., “Weighted finite-state transducers in speech recognition,” CSL, 2002.
- R. Prabhavalkar et al., “Less is more: Improved rnn-t decoding using limited label context and path merging,” ICASSP, 2021.
- J. J. Godfrey et al., “Switchboard: Telephone speech corpus for research and development,” in ICASSP, 1992.
- M. Kitza et al., “Cumulative adaptation for blstm acoustic models,” INTERSPEECH, 2019.
- D. S. Park et al., “Specaugment: A simple data augmentation method for automatic speech recognition,” in INTERSPEECH, 2019.
- Z. Tüske et al., “Single headed attention based sequence-to-sequence model for state-of-the-art results on switchboard,” INTERSPEECH, 2020.
- Z. Tuske et al., “On the limit of English conversational speech recognition,” in INTERSPEECH, 2021.
- W. Wang et al., “An investigation of phone-based subword units for end-to-end speech recognition,” INTERSPEECH, 2020.
- Mingyu Cui (31 papers)
- Jiajun Deng (75 papers)
- Shoukang Hu (38 papers)
- Xurong Xie (38 papers)
- Tianzi Wang (37 papers)
- Shujie Hu (36 papers)
- Mengzhe Geng (42 papers)
- Boyang Xue (23 papers)
- Xunying Liu (92 papers)
- Helen Meng (204 papers)