FastCorrect: Fast Error Correction with Edit Alignment for Automatic Speech Recognition (2105.03842v6)

Published 9 May 2021 in cs.CL, cs.LG, cs.SD, and eess.AS

Abstract: Error correction techniques have been used to refine the output sentences from automatic speech recognition (ASR) models and achieve a lower word error rate (WER) than original ASR outputs. Previous works usually use a sequence-to-sequence model to correct an ASR output sentence autoregressively, which causes large latency and cannot be deployed in online ASR services. A straightforward solution to reduce latency, inspired by non-autoregressive (NAR) neural machine translation, is to use an NAR sequence generation model for ASR error correction, which, however, comes at the cost of significantly increased ASR error rate. In this paper, observing distinctive error patterns and correction operations (i.e., insertion, deletion, and substitution) in ASR, we propose FastCorrect, a novel NAR error correction model based on edit alignment. In training, FastCorrect aligns each source token from an ASR output sentence to the target tokens from the corresponding ground-truth sentence based on the edit distance between the source and target sentences, and extracts the number of target tokens corresponding to each source token during edition/correction, which is then used to train a length predictor and to adjust the source tokens to match the length of the target sentence for parallel generation. In inference, the token number predicted by the length predictor is used to adjust the source tokens for target sequence generation. Experiments on the public AISHELL-1 dataset and an internal industrial-scale ASR dataset show the effectiveness of FastCorrect for ASR error correction: 1) it speeds up the inference by 6-9 times and maintains the accuracy (8-14% WER reduction) compared with the autoregressive correction model; and 2) it outperforms the popular NAR models adopted in neural machine translation and text edition by a large margin.

PDF Abstract

FastCorrect: Non-Autoregressive Error Correction for ASR

The paper "FastCorrect: Fast Error Correction with Edit Alignment for Automatic Speech Recognition" introduces a novel approach for error correction in ASR outputs, highlighting a distinct departure from traditional autoregressive methods. Employing non-autoregressive (NAR) techniques, FastCorrect aims to mitigate latency issues while preserving accuracy, leveraging the natural monotonic alignment of source and target sentences in ASR tasks.

Methodology

FastCorrect is built upon the foundation of edit alignment, recognizing three primary operations: insertion, deletion, and substitution. During training, each token from the ASR output is aligned with tokens from the ground truth using the edit distance. This alignment informs the model about target token counts for each source token, facilitating the training of a length predictor. During inference, the length predictor adjusts source tokens, enabling parallel target sequence generation.

The model architecture is characterized by a Transformer-based NAR design incorporating a length predictor. This predictor plays a crucial role by bridging the length discrepancy between input and desired output sequences, enabling efficient token alignment and manipulation.

Experimental Evaluation

The performance of FastCorrect is validated through rigorous experiments on both the public AISHELL-1 dataset and an extensive internal dataset. The results demonstrate a significant acceleration in inference, achieving a 6-9 times speedup over autoregressive models, while maintaining an 8-14% reduction in WER, signaling a comparable accuracy level. The comparison with other NAR models, such as LevT and FELIX, underscores FastCorrect's superior performance in both correction precision and speed.

Implications and Future Developments

FastCorrect’s design inherently addresses the limited data challenges in ASR correction by constructing a pseudo dataset for pre-training, which enhances the model's robustness. The approach of separating error detection (via the length predictor) from error correction (via the decoder) allows the model to handle insertion, deletion, and complex substitutions effectively.

Practically, FastCorrect’s advancements propose a feasible solution for real-time ASR applications, presenting a pathway to integrating high-speed error correction processes without compromising recognition quality. Theoretically, this model challenges the prevailing reliance on autoregressive approaches for sequence-to-sequence tasks in ASR and opens a dialogue for broader applications of NAR strategies in various domains.

Potential future developments could explore expansions of FastCorrect’s methodology to other sequence transformation tasks, such as machine translation or text correction, where similar alignment strategies might yield efficiency benefits. Additionally, investigations into leveraging multiple hypotheses from ASR output could further refine error correction outcomes.

Conclusion

The introduction of FastCorrect signifies an essential advancement in the pursuit of efficient and accurate ASR error correction mechanisms. By effectively combining edit alignment with a non-autoregressive framework, FastCorrect not only enhances computational efficiency but also reshapes the landscape of error correction models, advocating for broader adoption of NAR frameworks in sequence-to-sequence learning paradigms.

PDF Markdown Bookmark Chat (Pro)

Authors (10)

Yichong Leng (27 papers)
Xu Tan (164 papers)
Linchen Zhu (2 papers)
Jin Xu (131 papers)
Renqian Luo (19 papers)
Linquan Liu (8 papers)
Tao Qin (201 papers)
Xiang-Yang Li (77 papers)
Ed Lin (3 papers)
Tie-Yan Liu (242 papers)

Citations (60)

View on Semantic Scholar

Related Papers

Find Related Papers

GitHub

GitHub - microsoft/NeuralSpeech (1,436 stars)