LV-CTC: Non-autoregressive ASR with CTC and latent variable models (2403.19207v1)
Abstract: Non-autoregressive (NAR) models for automatic speech recognition (ASR) aim to achieve high accuracy and fast inference by simplifying the autoregressive (AR) generation process of conventional models. Connectionist temporal classification (CTC) is one of the key techniques used in NAR ASR models. In this paper, we propose a new model combining CTC and a latent variable model, which is one of the state-of-the-art models in the neural machine translation research field. A new neural network architecture and formulation specialized for ASR application are introduced. In the proposed model, CTC alignment is assumed to be dependent on the latent variables that are expected to capture dependencies between tokens. Experimental results on a 100 hours subset of Librispeech corpus showed the best recognition accuracy among CTC-based NAR models. On the TED-LIUM2 corpus, the best recognition accuracy is achieved including AR E2E models with faster inference speed.
- “Attention-Based Models for Speech Recognition” In Proc. Advances in Neural Information Processing Systems (NIPS) 28, 2015, pp. 577–585
- Alex Graves “Sequence Transduction with Recurrent Neural Networks” In Proc. of the 29th International Conference on Machine Learning (ICML), 2012
- “Listen, attend and spell: A neural network for large vocabulary conversational speech recognition” In Proc. 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2016, pp. 4960–4964 DOI: 10.1109/ICASSP.2016.7472621
- “End-to-End Speech Recognition: A Survey” In arXiv preprint arXiv:2303.03329, 2023
- “Attention is All you Need” In Proc. Advances in Neural Information Processing Systems (NIPS) 30, 2017, pp. 5998–6008 URL: http://papers.nips.cc/paper/7181-attention-is-all-you-need.pdf
- “Improving Transformer-Based End-to-End Speech Recognition with Connectionist Temporal Classification and Language Model Integration” In Proc. Interspeech 2019, 2019, pp. 1408–1412 DOI: 10.21437/Interspeech.2019-1938
- Shigeki Karita “A comparative study on Transformer vs RNN in speech applications” In 2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), 2019, pp. 449–456
- Linhao Dong, Shuang Xu and Bo Xu “Speech-transformer: a no-recurrence sequence-to-sequence model for speech recognition” In Proc. ICASSP, 2018, pp. 5884–5888 IEEE
- “Non-Autoregressive Neural Machine Translation” In Proc. International Conference on Learning Representations (ICLR), 2018
- Jason Lee, Elman Mansimov and Kyunghyun Cho “Deterministic non-autoregressive neural sequence modeling by iterative refinement” In Proc. 2018 Conference on Empirical Methods in Natural Language Processing (EMNLP), 2020, pp. 1173–1182
- “Mask-Predict: Parallel Decoding of Conditional Masked Language Models” In Proceedings of the EMNLP-IJCNLP, 2019, pp. 6112–6121 DOI: 10.18653/v1/D19-1633
- “Imputer: sequence modelling via imputation and dynamic programming” In Proce. of the 37th International Conference on Machine Learning (ICML), 2020, pp. 1403–1413
- “Improved Mask-CTC for Non-Autoregressive End-to-End ASR” In Proc. 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2021, pp. 8363–8367 DOI: 10.1109/ICASSP39728.2021.9414198
- “Insertion-Based Modeling for End-to-End Automatic Speech Recognition” In Proc. Interspeech 2020, 2020, pp. 3660–3664 DOI: 10.21437/Interspeech.2020-1619
- Ethan A Chi, Julian Salazar and Katrin Kirchhoff “Align-Refine: Non-Autoregressive Speech Recognition via Iterative Realignment” In Proceedings of NAACL-HLT, 2021, pp. 1920–1927
- “Listen and Fill in the Missing Letters: Non-Autoregressive Transformer for Speech Recognition.” In arXiv preprint arXiv:1911.04908, 2020
- “CASS-NAT: CTC alignment-based single step non-autoregressive transformer for speech recognition” In Proc. 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2021, pp. 5889–5893
- “Boundary and Context Aware Training for CIF-Based Non-Autoregressive End-to-End ASR” In 2021 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), 2021, pp. 328–334 DOI: 10.1109/ASRU51503.2021.9688238
- “A comparative study on non-autoregressive modelings for speech-to-text generation” In 2021 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), 2021, pp. 47–54
- “Streaming End-to-End ASR Based on Blockwise Non-Autoregressive Models” In Proc. Interspeech 2021, 2021, pp. 3755–3759 DOI: 10.21437/Interspeech.2021-1556
- “Toward Streaming ASR with Non-Autoregressive Insertion-Based Model” In Proc. Interspeech 2021, 2021, pp. 3740–3744 DOI: 10.21437/Interspeech.2021-1131
- Weiran Wang, Ke Hu and Tara N. Sainath “Deliberation of Streaming RNN-Transducer by Non-Autoregressive Decoding” In ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2022, pp. 7452–7456 DOI: 10.1109/ICASSP43922.2022.9746390
- “Connectionist Temporal Classification: Labelling Unsegmented Sequence Data with Recurrent Neural Networks” In Proc. of the 23rd International Conference on Machine Learning (ICML), 2006, pp. 369–376 DOI: 10.1145/1143844.1143891
- “Intermediate loss regularization for CTC-based speech recognition” In Proc. 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2021, pp. 6224–6228 IEEE
- “Relaxing the Conditional Independence Assumption of CTC-Based ASR by Conditioning on Intermediate Predictions” In Proc. Interspeech 2021, 2021, pp. 3735–3739 DOI: 10.21437/Interspeech.2021-911
- “Hierarchical Conditional End-to-End ASR with CTC and Multi-Granular Subword Units” In Proc. 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2022, pp. 7797–7801 IEEE
- “Fast decoding in sequence models using discrete latent variables” In International Conference on Machine Learning, 2018, pp. 2390–2399 PMLR
- “Latent-variable non-autoregressive neural machine translation with deterministic inference using a delta posterior” In Proceedings of the aaai conference on artificial intelligence 34.05, 2020, pp. 8846–8853
- “Librispeech: An ASR corpus based on public domain audio books” In Proc. 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2015, pp. 5206–5210 DOI: 10.1109/ICASSP.2015.7178964
- Anthony Rousseau, Paul Deléglise and Yannick Estève “Enhancing the TED-LIUM Corpus with Selected Data for Language Modeling and More TED Talks” In Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC’14), 2014, pp. 3935–3939
- “Self-Distillation for Improving CTC-Transformer-Based ASR Systems” In Proc. Interspeech 2020, 2020, pp. 546–550 DOI: 10.21437/Interspeech.2020-1223
- “Align-Denoise: Single-Pass Non-Autoregressive Speech Recognition” In Proc. Interspeech 2021, 2021, pp. 3770–3774 DOI: 10.21437/Interspeech.2021-1906
- “Glancing Transformer for Non-Autoregressive Neural Machine Translation” In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), 2021, pp. 1993–2003
- “Conformer: Convolution-augmented Transformer for Speech Recognition” In Proc. Interspeech 2020, 2020, pp. 5036–5040 DOI: 10.21437/Interspeech.2020-3015
- “E-branchformer: Branchformer with enhanced merging for speech recognition” In 2022 IEEE Spoken Language Technology Workshop (SLT), 2023, pp. 84–91 IEEE
- Diederik P. Kingma and Max Welling “Auto-Encoding Variational Bayes” In 2nd International Conference on Learning Representations, ICLR 2014, Banff, AB, Canada, April 14-16, 2014, Conference Track Proceedings, 2014 arXiv:http://arxiv.org/abs/1312.6114v10 [stat.ML]
- Daniel S. Park “SpecAugment: A Simple Data Augmentation Method for Automatic Speech Recognition” In Proc. Interspeech 2019, 2019, pp. 2613–2617 DOI: 10.21437/Interspeech.2019-2680
- Shinji Watanabe “ESPnet: End-to-End Speech Processing Toolkit” In Proc. Interspeech 2018, 2018, pp. 2207–2211 DOI: 10.21437/Interspeech.2018-1456
- “Transformer-XL: Attentive Language Models beyond a Fixed-Length Context” In Proceedings of ACL, 2019, pp. 2978–2988 DOI: 10.18653/v1/P19-1285
- Diederik P Kingma and Jimmy Ba “Adam: A method for stochastic optimization” In Proc. International Conference on Learning Representations (ICLR), 2015
- Pengcheng Guo “Recent Developments on ESPnet Toolkit Boosted By Conformer” In Proc. 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2021, pp. 5874–5878 DOI: 10.1109/ICASSP39728.2021.9414858
- “A study of transducer based end-to-end ASR with ESPnet: Architecture, auxiliary loss and decoding strategies” In 2021 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), 2021, pp. 16–23 IEEE
- “Hybrid CTC/Attention Architecture for End-to-End Speech Recognition” In IEEE Journal of Selected Topics in Signal Processing 11.8, 2017, pp. 1240–1253 DOI: 10.1109/JSTSP.2017.2763455
- Yuya Fujita (16 papers)
- Shinji Watanabe (416 papers)
- Xuankai Chang (61 papers)
- Takashi Maekaku (9 papers)