CTC Blank Triggered Dynamic Layer-Skipping for Efficient CTC-based Speech Recognition (2401.02046v1)
Abstract: Deploying end-to-end speech recognition models with limited computing resources remains challenging, despite their impressive performance. Given the gradual increase in model size and the wide range of model applications, selectively executing model components for different inputs to improve the inference efficiency is of great interest. In this paper, we propose a dynamic layer-skipping method that leverages the CTC blank output from intermediate layers to trigger the skipping of the last few encoder layers for frames with high blank probabilities. Furthermore, we factorize the CTC output distribution and perform knowledge distillation on intermediate layers to reduce computation and improve recognition accuracy. Experimental results show that by utilizing the CTC blank, the encoder layer depth can be adjusted dynamically, resulting in 29% acceleration of the CTC model inference with minor performance degradation.
- “Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks,” in Proceedings of the 23rd international conference on Machine learning, 2006, pp. 369–376.
- Alex Graves, “Sequence transduction with recurrent neural networks,” arXiv preprint arXiv:1211.3711, 2012.
- “Listen, attend and spell: A neural network for large vocabulary conversational speech recognition,” in 2016 IEEE international conference on acoustics, speech and signal processing (ICASSP). IEEE, 2016, pp. 4960–4964.
- “Robust speech recognition via large-scale weak supervision,” in International Conference on Machine Learning. PMLR, 2023, pp. 28492–28518.
- “Scaling speech technology to 1,000+ languages,” arXiv preprint arXiv:2305.13516, 2023.
- “Distilling the knowledge in a neural network,” arXiv preprint arXiv:1503.02531, 2015.
- “Distilhubert: Speech representation learning by layer-wise distillation of hidden-unit bert,” in ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2022, pp. 7087–7091.
- “Parp: Prune, adjust and re-prune for self-supervised speech recognition,” Advances in Neural Information Processing Systems, vol. 34, pp. 21256–21272, 2021.
- Ke Tan and DeLiang Wang, “Compressing deep neural networks for efficient speech enhancement,” in ICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2021, pp. 8358–8362.
- “Dynamic neural networks: A survey,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 44, no. 11, pp. 7436–7456, 2021.
- “Layer Pruning on Demand with Intermediate CTC,” in Proc. Interspeech 2021, 2021, pp. 3745–3749.
- “Reducing transformer depth on demand with structured dropout,” in International Conference on Learning Representations, 2020.
- “Dynamic encoder transducer: A flexible solution for trading off accuracy for latency,” 08 2021, pp. 2042–2046.
- “Compute cost amortized transformer for streaming asr,” in Interspeech 2022, 2022.
- “I3d: Transformer architectures with input-dependent dynamic depth for speech recognition,” in ICASSP 2023 - 2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2023, pp. 1–5.
- “Gated contextual adapters for selective contextual biasing in neural transducers,” in ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2023, pp. 1–5.
- “Accelerating rnn-t training and inference using ctc guidance,” in ICASSP 2023 - 2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2023, pp. 1–5.
- “Fsr: Accelerating the inference process of transducer-based models by applying fast-skip regularization,” in Interspeech, 2021.
- “Blank-regularized ctc for frame skipping in neural transducer,” arXiv preprint arXiv:2305.11558, 2023.
- “Factorized blank thresholding for improved runtime efficiency of neural transducers,” in ICASSP 2023 - 2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2023, pp. 1–5.
- “Wenet 2.0: More productive end-to-end speech recognition toolkit,” arXiv preprint arXiv:2203.15455, 2022.
- “Phone synchronous decoding with ctc lattice.,” in Interspeech, 2016, pp. 1923–1927.
- “Universally slimmable networks and improved training techniques,” in 2019 IEEE/CVF International Conference on Computer Vision (ICCV), 2019, pp. 1803–1811.
- “Librispeech: an asr corpus based on public domain audio books,” in 2015 IEEE international conference on acoustics, speech and signal processing (ICASSP). IEEE, 2015, pp. 5206–5210.
- “Neural machine translation of rare words with subword units,” arXiv preprint arXiv:1508.07909, 2015.