UCorrect: An Unsupervised Framework for Automatic Speech Recognition Error Correction (2401.05689v1)
Abstract: Error correction techniques have been used to refine the output sentences from automatic speech recognition (ASR) models and achieve a lower word error rate (WER). Previous works usually adopt end-to-end models and has strong dependency on Pseudo Paired Data and Original Paired Data. But when only pre-training on Pseudo Paired Data, previous models have negative effect on correction. While fine-tuning on Original Paired Data, the source side data must be transcribed by a well-trained ASR model, which takes a lot of time and not universal. In this paper, we propose UCorrect, an unsupervised Detector-Generator-Selector framework for ASR Error Correction. UCorrect has no dependency on the training data mentioned before. The whole procedure is first to detect whether the character is erroneous, then to generate some candidate characters and finally to select the most confident one to replace the error character. Experiments on the public AISHELL-1 dataset and WenetSpeech dataset show the effectiveness of UCorrect for ASR error correction: 1) it achieves significant WER reduction, achieves 6.83\% even without fine-tuning and 14.29\% after fine-tuning; 2) it outperforms the popular NAR correction models by a large margin with a competitive low latency; and 3) it is an universal method, as it reduces all WERs of the ASR model with different decoding strategies and reduces all WERs of ASR models trained on different scale datasets.
- “Conformer: Convolution-augmented transformer for speech recognition,” in Interspeech 2020, 21st Annual Conference of the International Speech Communication Association, Virtual Event, Shanghai, China, 25-29 October 2020, Helen Meng, Bo Xu, and Thomas Fang Zheng, Eds. 2020, pp. 5036–5040, ISCA.
- “Unified streaming and non-streaming two-pass end-to-end model for speech recognition,” CoRR, vol. abs/2012.05481, 2020.
- “U2++: unified two-pass bidirectional end-to-end model for speech recognition,” CoRR, vol. abs/2106.05642, 2021.
- “Attention is all you need,” in Advances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Systems 2017, December 4-9, 2017, Long Beach, CA, USA, Isabelle Guyon, Ulrike von Luxburg, Samy Bengio, Hanna M. Wallach, Rob Fergus, S. V. N. Vishwanathan, and Roman Garnett, Eds., 2017, pp. 5998–6008.
- “Levenshtein transformer,” in Advances in Neural Information Processing Systems 32: Annual Conference on Neural Information Processing Systems 2019, NeurIPS 2019, December 8-14, 2019, Vancouver, BC, Canada, Hanna M. Wallach, Hugo Larochelle, Alina Beygelzimer, Florence d’Alché-Buc, Emily B. Fox, and Roman Garnett, Eds., 2019, pp. 11179–11189.
- “Fastcorrect: Fast error correction with edit alignment for automatic speech recognition,” in Advances in Neural Information Processing Systems 34: Annual Conference on Neural Information Processing Systems 2021, NeurIPS 2021, December 6-14, 2021, virtual, Marc’Aurelio Ranzato, Alina Beygelzimer, Yann N. Dauphin, Percy Liang, and Jennifer Wortman Vaughan, Eds., 2021, pp. 21708–21719.
- “AISHELL-1: an open-source mandarin speech corpus and A speech recognition baseline,” CoRR, vol. abs/1709.05522, 2017.
- “WENETSPEECH: A 10000+ hours multi-domain mandarin corpus for speech recognition,” in IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP 2022, Virtual and Singapore, 23-27 May 2022. 2022, pp. 6182–6186, IEEE.
- “BERT: pre-training of deep bidirectional transformers for language understanding,” in Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT 2019, Minneapolis, MN, USA, June 2-7, 2019, Volume 1 (Long and Short Papers), Jill Burstein, Christy Doran, and Thamar Solorio, Eds. 2019, pp. 4171–4186, Association for Computational Linguistics.
- “Visually and phonologically similar characters in incorrect simplified chinese words,” in COLING 2010, 23rd International Conference on Computational Linguistics, Posters Volume, 23-27 August 2010, Beijing, China, Chu-Ren Huang and Dan Jurafsky, Eds. 2010, pp. 739–747, Chinese Information Processing Society of China.
- “Espnet: End-to-end speech processing toolkit,” in Interspeech 2018, 19th Annual Conference of the International Speech Communication Association, Hyderabad, India, 2-6 September 2018, B. Yegnanarayana, Ed. 2018, pp. 2207–2211, ISCA.
- “Wenet 2.0: More productive end-to-end speech recognition toolkit,” in Interspeech 2022, 23rd Annual Conference of the International Speech Communication Association, Incheon, Korea, 18-22 September 2022, Hanseok Ko and John H. L. Hansen, Eds. 2022, pp. 1661–1665, ISCA.
- Jiaxin Guo (40 papers)
- Minghan Wang (23 papers)
- Xiaosong Qiao (5 papers)
- Daimeng Wei (31 papers)
- Hengchao Shang (22 papers)
- Zongyao Li (23 papers)
- Zhengzhe Yu (4 papers)
- Yinglu Li (6 papers)
- Chang Su (37 papers)
- Min Zhang (630 papers)
- Shimin Tao (31 papers)
- Hao Yang (328 papers)