Blank-regularized CTC for Frame Skipping in Neural Transducer (2305.11558v1)

Published 19 May 2023 in eess.AS and cs.CL

Abstract: Neural Transducer and connectionist temporal classification (CTC) are popular end-to-end automatic speech recognition systems. Due to their frame-synchronous design, blank symbols are introduced to address the length mismatch between acoustic frames and output tokens, which might bring redundant computation. Previous studies managed to accelerate the training and inference of neural Transducers by discarding frames based on the blank symbols predicted by a co-trained CTC. However, there is no guarantee that the co-trained CTC can maximize the ratio of blank symbols. This paper proposes two novel regularization methods to explicitly encourage more blanks by constraining the self-loop of non-blank symbols in the CTC. It is interesting to find that the frame reduction ratio of the neural Transducer can approach the theoretical boundary. Experiments on LibriSpeech corpus show that our proposed method accelerates the inference of neural Transducer by 4 times without sacrificing performance. Our work is open-sourced and publicly available https://github.com/k2-fsa/icefall.

Citations (7)

View on Semantic Scholar

Summary

The paper presents soft and hard blank-regularization techniques that reduce redundant non-blank predictions in CTC models.
It achieves up to 75% frame reduction and a fourfold speed increase during inference on the LibriSpeech corpus with stable word error rates.
The findings offer practical strategies for integrating efficient frame skipping into real-time ASR systems and other neural transducer architectures.

Overview of Blank-Regularized CTC for Frame Skipping in Neural Transducers

The paper "Blank-regularized CTC for Frame Skipping in Neural Transducer" by Yifan Yang et al. presents two innovative methods aimed at enhancing the efficiency of Neural Transducer models in automatic speech recognition (ASR) tasks. Neural Transducers and Connectionist Temporal Classification (CTC) are widely used in end-to-end ASR systems, with both employing blank symbols to handle the mismatch in sequence length between input frames and output tokens. However, this often results in computational inefficiencies due to redundant calculations.

Key Contributions

The researchers propose two regularization techniques to encourage a higher ratio of blank symbol predictions in the CTC model, with the objective of optimizing frame skipping:

Soft Restriction: This method applies a penalty on the self-loop of non-blank symbols in the CTC topology, encouraging the model to produce more blank frames by reducing the occurrence of consecutively repeated non-blank symbols.
Hard Restriction: This approach directly limits the maximum number of consecutively repeated non-blank labels during training, prunning redundant paths in the CTC topology.

By applying these strategies, the authors achieved a substantial increase in frame reduction ratios without a corresponding degradation in performance.

Experimental Results

Experiments conducted on the LibriSpeech corpus demonstrated significant improvements. The frame reduction ratio achieved with blank regularization almost reached the theoretical boundary, resulting in a fourfold speed increase during inference. Notably, these methods surpassed existing techniques in balancing word error rate (WER) and inference speed.

The results are compelling, with configurations achieving over 75% frame reduction and reductions in WER comparable to the baseline model not utilizing frame skipping. This indicates that the discarded frames, previously identified as redundant, do not contribute significantly to the decoding accuracy.

Applications and Implications

The implications of these findings are substantial for real-time ASR applications where processing speed is critical. The proposed methods can be integrated into existing Neural Transducer architectures to enhance efficiency without necessitating complex architectural changes.

Future Directions

This work invites further exploration in:

Generalization: Testing the proposed blank regularization techniques across different datasets and ASR architectures to assess their robustness and adaptability.
Parameter Optimization: Investigating alternative methods and criteria for setting the penalty and repetition limits that may yield better trade-offs in different scenarios.
Integration with External LLMs: The more streamlined frame processing might allow for even more effective LLM integration, providing further accuracy improvements.

In summary, the paper presents a detailed paper on optimizing CTC-guided frame skipping in neural Transducers, offering a practical approach to increase ASR efficiency significantly.

PDF Markdown

Related Papers

GitHub

GitHub - k2-fsa/icefall (922 stars)