Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
102 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Two-pass Decoding and Cross-adaptation Based System Combination of End-to-end Conformer and Hybrid TDNN ASR Systems (2206.11596v1)

Published 23 Jun 2022 in eess.AS and cs.AI

Abstract: Fundamental modelling differences between hybrid and end-to-end (E2E) automatic speech recognition (ASR) systems create large diversity and complementarity among them. This paper investigates multi-pass rescoring and cross adaptation based system combination approaches for hybrid TDNN and Conformer E2E ASR systems. In multi-pass rescoring, state-of-the-art hybrid LF-MMI trained CNN-TDNN system featuring speed perturbation, SpecAugment and Bayesian learning hidden unit contributions (LHUC) speaker adaptation was used to produce initial N-best outputs before being rescored by the speaker adapted Conformer system using a 2-way cross system score interpolation. In cross adaptation, the hybrid CNN-TDNN system was adapted to the 1-best output of the Conformer system or vice versa. Experiments on the 300-hour Switchboard corpus suggest that the combined systems derived using either of the two system combination approaches outperformed the individual systems. The best combined system obtained using multi-pass rescoring produced statistically significant word error rate (WER) reductions of 2.5% to 3.9% absolute (22.5% to 28.9% relative) over the stand alone Conformer system on the NIST Hub5'00, Rt03 and Rt02 evaluation data.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (55)
  1. K. Veselỳ et al., “Sequence-discriminative training of deep neural networks.” in INTERSPEECH, 2013.
  2. D. Povey et al., “Semi-orthogonal low-rank matrix factorization for deep neural networks.” in INTERSPEECH, 2018.
  3. O. Abdel-Hamid et al., “Applying convolutional neural networks concepts to hybrid nn-hmm model for speech recognition,” in ICASSP, 2012.
  4. L. R. Medsker et al., “Recurrent neural networks,” Design and Applications, 2001.
  5. A. Graves et al., “Speech recognition with deep recurrent neural networks,” in ICASSP, 2013.
  6. H. Sak et al., “Long short-term memory based recurrent neural network architectures for large vocabulary speech recognition,” arXiv preprint arXiv:1402.1128, 2014.
  7. V. Peddinti et al., “A time delay neural network architecture for efficient modeling of long temporal contexts,” in ISCA, 2015.
  8. D. Povey et al., “Purely sequence-trained neural networks for asr based on lattice-free mmi.” in INTERSPEECH, 2016.
  9. W. Chan et al., “Listen, attend and spell: A neural network for large vocabulary conversational speech recognition,” in ICASSP, 2016.
  10. A. Graves, “Sequence transduction with recurrent neural networks,” arXiv preprint arXiv:1211.3711, 2012.
  11. A. Vaswani et al., “Attention is all you need,” Advances in neural information processing systems, 2017.
  12. L. Dong et al., “Speech-transformer: a no-recurrence sequence-to-sequence model for speech recognition,” in ICASSP, 2018.
  13. A. Gulati et al., “Conformer: Convolution-augmented transformer for speech recognition,” arXiv preprint arXiv:2005.08100, 2020.
  14. P. Guo et al., “Recent developments on espnet toolkit boosted by conformer,” in ICASSP, 2021.
  15. S. Karita et al., “A comparative study on transformer vs rnn in speech applications,” in ASRU, 2019.
  16. P. Woodland et al., “Superears: Multi-site broadcast news system,” in Rich Transcription (RT-04F) Workshop, 2004.
  17. R. Schwartz et al., “Speech recognition in multiple languages and domains: the 2003 bbn/limsi ears system,” in ICASSP, 2004.
  18. X. Lei et al., “Development of the 2008 sri mandarin speech-to-text system for broadcast news and conversation,” in ISCA, 2009.
  19. S. M. Chu et al., “The 2009 ibm gale mandarin broadcast transcription system,” in ICASSP, 2010.
  20. X. Liu et al., “Language model cross adaptation for lvcsr system combination,” CSL, 2013.
  21. L. Lamel et al., “Improved models for mandarin speech-to-text transcription,” in ICASSP, 2011.
  22. P. C. Woodland et al., “The 1994 htk large vocabulary speech recognition system,” in ICASSP, 1995.
  23. T. Hain et al., “Automatic transcription of conversational telephone speech,” IEEE/ACM TSALP, 2005.
  24. T. Hain et al., “Automatic transcription of conversational telephone speech-development of the cu-htk 2002 system,” in ICASSP, 2003.
  25. B. Peskin et al., “Improvements in recognition of conversational telephone speech,” in ICASSP, 1999.
  26. R. Prasad et al., “The 2004 bbn/limsi 20xrt english conversational telephone speech recognition system,” in ECSCT, 2005.
  27. J. G. Fiscus, “A post-processing system to yield reduced word error rates: Recognizer output voting error reduction (rover),” in ASRUP Workshop, 1997.
  28. G. Evermann et al., “Posterior probability decoding, confidence estimation and system combination,” in Proc. Speech Transcription Workshop, 2000.
  29. S. Watanabe et al., “Hybrid ctc/attention architecture for end-to-end speech recognition,” Selected Topics in Signal Processing, 2017.
  30. T. N. Sainath et al., “Two-pass end-to-end speech recognition,” arXiv preprint arXiv:1908.10992, 2019.
  31. Q. Li et al., “Integrating source-channel and attention-based sequence-to-sequence models for speech recognition,” in ASRU, 2019.
  32. D. S. Park et al., “Specaugment: A simple data augmentation method for automatic speech recognition,” arXiv preprint arXiv:1904.08779, 2019.
  33. X. Xie et al., “Blhuc: Bayesian learning of hidden unit contributions for deep neural network speaker adaptation,” in ICASSP, 2019.
  34. J. H. Wong et al., “Combination of end-to-end and hybrid models for speech recognition.” in INTERSPEECH, 2020.
  35. X. Xie et al., “Bayesian learning for deep neural network adaptation,” IEEE/ACM TASLP, 2021.
  36. J. Deng et al., “Confidence score based conformer speaker adaptation for speech recognition,” INTERSPEECH, 2022.
  37. Y. N. Dauphin et al., “Language modeling with gated convolutional networks,” in ICML, 2017.
  38. P. Swietojanski et al., “Learning hidden unit contributions for unsupervised speaker adaptation of neural network acoustic models,” SLT Workshop, 2014.
  39. P. Swietojanski et al., “Learning hidden unit contributions for unsupervised acoustic model adaptation,” IEEE/ACM TASLP, 2016.
  40. C. Zhang et al., “Dnn speaker adaptation using parameterised sigmoid and relu hidden activation functions,” ICASSP, 2016.
  41. Z. Huang et al., “Bayesian unsupervised batch and online speaker adaptation of activation function parameters in deep models for automatic speech recognition,” IEEE/ACM TASLP, 2017.
  42. A. Gulati et al., “Conformer: Convolution-augmented transformer for speech recognition,” INTERSPEECH, 2020.
  43. D. Povey et al., “The kaldi speech recognition toolkit,” in Workshop on ASRU, 2011.
  44. J. J. Odell et al., “A one pass decoder design for large vocabulary recognition,” in Human Language Technology, 1994.
  45. H. Ney et al., “Dynamic programming search for continuous speech recognition,” Signal Processing Magazine, 1999.
  46. S. Ortmanns et al., “Look-ahead techniques for fast beam search,” CSL, 2000.
  47. D. Rybach et al., “Lexical prefix tree and wfst: A comparison of two dynamic search concepts for lvcsr,” IEEE/ACM TASLP, 2013.
  48. M. Mohri et al., “Weighted finite-state transducers in speech recognition,” CSL, 2002.
  49. R. Prabhavalkar et al., “Less is more: Improved rnn-t decoding using limited label context and path merging,” ICASSP, 2021.
  50. J. J. Godfrey et al., “Switchboard: Telephone speech corpus for research and development,” in ICASSP, 1992.
  51. M. Kitza et al., “Cumulative adaptation for blstm acoustic models,” INTERSPEECH, 2019.
  52. D. S. Park et al., “Specaugment: A simple data augmentation method for automatic speech recognition,” in INTERSPEECH, 2019.
  53. Z. Tüske et al., “Single headed attention based sequence-to-sequence model for state-of-the-art results on switchboard,” INTERSPEECH, 2020.
  54. Z. Tuske et al., “On the limit of English conversational speech recognition,” in INTERSPEECH, 2021.
  55. W. Wang et al., “An investigation of phone-based subword units for end-to-end speech recognition,” INTERSPEECH, 2020.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (10)
  1. Mingyu Cui (31 papers)
  2. Jiajun Deng (75 papers)
  3. Shoukang Hu (38 papers)
  4. Xurong Xie (38 papers)
  5. Tianzi Wang (37 papers)
  6. Shujie Hu (36 papers)
  7. Mengzhe Geng (42 papers)
  8. Boyang Xue (23 papers)
  9. Xunying Liu (92 papers)
  10. Helen Meng (204 papers)
Citations (9)