FastCorrect 2: Fast Error Correction on Multiple Candidates for Automatic Speech Recognition (2109.14420v4)

Published 29 Sep 2021 in cs.CL, cs.LG, cs.SD, and eess.AS

Abstract: Error correction is widely used in automatic speech recognition (ASR) to post-process the generated sentence, and can further reduce the word error rate (WER). Although multiple candidates are generated by an ASR system through beam search, current error correction approaches can only correct one sentence at a time, failing to leverage the voting effect from multiple candidates to better detect and correct error tokens. In this work, we propose FastCorrect 2, an error correction model that takes multiple ASR candidates as input for better correction accuracy. FastCorrect 2 adopts non-autoregressive generation for fast inference, which consists of an encoder that processes multiple source sentences and a decoder that generates the target sentence in parallel from the adjusted source sentence, where the adjustment is based on the predicted duration of each source token. However, there are some issues when handling multiple source sentences. First, it is non-trivial to leverage the voting effect from multiple source sentences since they usually vary in length. Thus, we propose a novel alignment algorithm to maximize the degree of token alignment among multiple sentences in terms of token and pronunciation similarity. Second, the decoder can only take one adjusted source sentence as input, while there are multiple source sentences. Thus, we develop a candidate predictor to detect the most suitable candidate for the decoder. Experiments on our inhouse dataset and AISHELL-1 show that FastCorrect 2 can further reduce the WER over the previous correction model with single candidate by 3.2% and 2.6%, demonstrating the effectiveness of leveraging multiple candidates in ASR error correction. FastCorrect 2 achieves better performance than the cascaded re-scoring and correction pipeline and can serve as a unified post-processing module for ASR.

PDF Abstract

Analysis of FastCorrect 2: Fast Error Correction on Multiple Candidates for Automatic Speech Recognition

The paper "FastCorrect 2: Fast Error Correction on Multiple Candidates for Automatic Speech Recognition" presents a novel model aimed at improving error correction in automatic speech recognition (ASR) systems by utilizing multiple candidates generated through beam search. The authors propose FastCorrect 2, which integrates non-autoregressive generation for quick inference and leverages multiple hypotheses to enhance correction accuracy. This work is particularly significant in reducing word error rate (WER) without substantially increasing latency, making it feasible for industrial applications.

Key Contributions

The primary contribution of this research is the development of a method to utilize multiple ASR candidates for error correction, which takes into account the "voting effect" among these candidates. This approach contrasts with traditional error correction models that process one sentence at a time, thus missing the potential consensus derived from multiple candidate sentences.

The authors devise a specialized alignment algorithm to facilitate this multi-candidate approach. The algorithm focuses on maximizing token alignment and pronunciation similarity across candidates of varying lengths. It efficiently maps tokens by analyzing both matching scores and phonetic resemblance, using these metrics to align tokens effectively for subsequent processing.

FastCorrect 2 enhances the original FastCorrect model by integrating several new components:

Encoder with Pre-Net: The encoder concatenates the embeddings of all candidate tokens position-wise and adjusts them using a linear layer, thus capturing combined contextual information.
Duration Predictor: This component estimates the number of target tokens aligned to each source token per candidate, allowing the model to adjust inputs dynamically based on predicted durations.
Candidate Predictor: Designed to select the most amenable candidate for correction, this predictor effectively identifies the sentence version likely to result in the lowest correction loss.

Empirical Results

FastCorrect 2 was evaluated on both the AISHELL-1 and an unspecified internal dataset. It demonstrated WER reductions of 3.2% and 2.6% on these datasets, respectively, compared to previous models. The model outperformed both the single sentence correction models and cascaded methods (combining re-scoring and correction pipelines) in terms of both accuracy and efficiency. Furthermore, FastCorrect 2 is approximately 5 times faster than its autoregressive counterparts, making it highly suitable for real-time applications.

Implications and Future Directions

From a theoretical perspective, FastCorrect 2 offers a clear demonstration of how candidate diversity can be harnessed for improved error detection and correction. Practically, the model serves as a more efficient alternative to traditional error correction and re-scoring pipelines, merging their functionalities into a single, faster system.

Potential future developments could involve expanding the model's applicability to different languages or adapting it for more diverse ASR systems. Investigating how this model performs with larger beam sizes and exploring its integration with end-to-end ASR frameworks might also yield promising results.

Overall, the paper presents a structured approach to improving ASR error correction using multiple hypotheses, setting a foundation for further exploration in multi-candidate processing techniques.

PDF Markdown Bookmark Chat (Pro)

Authors (11)

Yichong Leng (27 papers)
Xu Tan (164 papers)
Rui Wang (996 papers)
Linchen Zhu (2 papers)
Jin Xu (131 papers)
Wenjie Liu (85 papers)
Linquan Liu (8 papers)
Tao Qin (201 papers)
Xiang-Yang Li (77 papers)
Edward Lin (7 papers)
Tie-Yan Liu (242 papers)

Citations (38)

View on Semantic Scholar

Related Papers

Find Related Papers

GitHub

GitHub - microsoft/NeuralSpeech (1,436 stars)