Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
38 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

LV-CTC: Non-autoregressive ASR with CTC and latent variable models (2403.19207v1)

Published 28 Mar 2024 in eess.AS

Abstract: Non-autoregressive (NAR) models for automatic speech recognition (ASR) aim to achieve high accuracy and fast inference by simplifying the autoregressive (AR) generation process of conventional models. Connectionist temporal classification (CTC) is one of the key techniques used in NAR ASR models. In this paper, we propose a new model combining CTC and a latent variable model, which is one of the state-of-the-art models in the neural machine translation research field. A new neural network architecture and formulation specialized for ASR application are introduced. In the proposed model, CTC alignment is assumed to be dependent on the latent variables that are expected to capture dependencies between tokens. Experimental results on a 100 hours subset of Librispeech corpus showed the best recognition accuracy among CTC-based NAR models. On the TED-LIUM2 corpus, the best recognition accuracy is achieved including AR E2E models with faster inference speed.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (43)
  1. “Attention-Based Models for Speech Recognition” In Proc. Advances in Neural Information Processing Systems (NIPS) 28, 2015, pp. 577–585
  2. Alex Graves “Sequence Transduction with Recurrent Neural Networks” In Proc. of the 29th International Conference on Machine Learning (ICML), 2012
  3. “Listen, attend and spell: A neural network for large vocabulary conversational speech recognition” In Proc. 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2016, pp. 4960–4964 DOI: 10.1109/ICASSP.2016.7472621
  4. “End-to-End Speech Recognition: A Survey” In arXiv preprint arXiv:2303.03329, 2023
  5. “Attention is All you Need” In Proc. Advances in Neural Information Processing Systems (NIPS) 30, 2017, pp. 5998–6008 URL: http://papers.nips.cc/paper/7181-attention-is-all-you-need.pdf
  6. “Improving Transformer-Based End-to-End Speech Recognition with Connectionist Temporal Classification and Language Model Integration” In Proc. Interspeech 2019, 2019, pp. 1408–1412 DOI: 10.21437/Interspeech.2019-1938
  7. Shigeki Karita “A comparative study on Transformer vs RNN in speech applications” In 2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), 2019, pp. 449–456
  8. Linhao Dong, Shuang Xu and Bo Xu “Speech-transformer: a no-recurrence sequence-to-sequence model for speech recognition” In Proc. ICASSP, 2018, pp. 5884–5888 IEEE
  9. “Non-Autoregressive Neural Machine Translation” In Proc. International Conference on Learning Representations (ICLR), 2018
  10. Jason Lee, Elman Mansimov and Kyunghyun Cho “Deterministic non-autoregressive neural sequence modeling by iterative refinement” In Proc. 2018 Conference on Empirical Methods in Natural Language Processing (EMNLP), 2020, pp. 1173–1182
  11. “Mask-Predict: Parallel Decoding of Conditional Masked Language Models” In Proceedings of the EMNLP-IJCNLP, 2019, pp. 6112–6121 DOI: 10.18653/v1/D19-1633
  12. “Imputer: sequence modelling via imputation and dynamic programming” In Proce. of the 37th International Conference on Machine Learning (ICML), 2020, pp. 1403–1413
  13. “Improved Mask-CTC for Non-Autoregressive End-to-End ASR” In Proc. 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2021, pp. 8363–8367 DOI: 10.1109/ICASSP39728.2021.9414198
  14. “Insertion-Based Modeling for End-to-End Automatic Speech Recognition” In Proc. Interspeech 2020, 2020, pp. 3660–3664 DOI: 10.21437/Interspeech.2020-1619
  15. Ethan A Chi, Julian Salazar and Katrin Kirchhoff “Align-Refine: Non-Autoregressive Speech Recognition via Iterative Realignment” In Proceedings of NAACL-HLT, 2021, pp. 1920–1927
  16. “Listen and Fill in the Missing Letters: Non-Autoregressive Transformer for Speech Recognition.” In arXiv preprint arXiv:1911.04908, 2020
  17. “CASS-NAT: CTC alignment-based single step non-autoregressive transformer for speech recognition” In Proc. 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2021, pp. 5889–5893
  18. “Boundary and Context Aware Training for CIF-Based Non-Autoregressive End-to-End ASR” In 2021 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), 2021, pp. 328–334 DOI: 10.1109/ASRU51503.2021.9688238
  19. “A comparative study on non-autoregressive modelings for speech-to-text generation” In 2021 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), 2021, pp. 47–54
  20. “Streaming End-to-End ASR Based on Blockwise Non-Autoregressive Models” In Proc. Interspeech 2021, 2021, pp. 3755–3759 DOI: 10.21437/Interspeech.2021-1556
  21. “Toward Streaming ASR with Non-Autoregressive Insertion-Based Model” In Proc. Interspeech 2021, 2021, pp. 3740–3744 DOI: 10.21437/Interspeech.2021-1131
  22. Weiran Wang, Ke Hu and Tara N. Sainath “Deliberation of Streaming RNN-Transducer by Non-Autoregressive Decoding” In ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2022, pp. 7452–7456 DOI: 10.1109/ICASSP43922.2022.9746390
  23. “Connectionist Temporal Classification: Labelling Unsegmented Sequence Data with Recurrent Neural Networks” In Proc. of the 23rd International Conference on Machine Learning (ICML), 2006, pp. 369–376 DOI: 10.1145/1143844.1143891
  24. “Intermediate loss regularization for CTC-based speech recognition” In Proc. 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2021, pp. 6224–6228 IEEE
  25. “Relaxing the Conditional Independence Assumption of CTC-Based ASR by Conditioning on Intermediate Predictions” In Proc. Interspeech 2021, 2021, pp. 3735–3739 DOI: 10.21437/Interspeech.2021-911
  26. “Hierarchical Conditional End-to-End ASR with CTC and Multi-Granular Subword Units” In Proc. 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2022, pp. 7797–7801 IEEE
  27. “Fast decoding in sequence models using discrete latent variables” In International Conference on Machine Learning, 2018, pp. 2390–2399 PMLR
  28. “Latent-variable non-autoregressive neural machine translation with deterministic inference using a delta posterior” In Proceedings of the aaai conference on artificial intelligence 34.05, 2020, pp. 8846–8853
  29. “Librispeech: An ASR corpus based on public domain audio books” In Proc. 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2015, pp. 5206–5210 DOI: 10.1109/ICASSP.2015.7178964
  30. Anthony Rousseau, Paul Deléglise and Yannick Estève “Enhancing the TED-LIUM Corpus with Selected Data for Language Modeling and More TED Talks” In Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC’14), 2014, pp. 3935–3939
  31. “Self-Distillation for Improving CTC-Transformer-Based ASR Systems” In Proc. Interspeech 2020, 2020, pp. 546–550 DOI: 10.21437/Interspeech.2020-1223
  32. “Align-Denoise: Single-Pass Non-Autoregressive Speech Recognition” In Proc. Interspeech 2021, 2021, pp. 3770–3774 DOI: 10.21437/Interspeech.2021-1906
  33. “Glancing Transformer for Non-Autoregressive Neural Machine Translation” In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), 2021, pp. 1993–2003
  34. “Conformer: Convolution-augmented Transformer for Speech Recognition” In Proc. Interspeech 2020, 2020, pp. 5036–5040 DOI: 10.21437/Interspeech.2020-3015
  35. “E-branchformer: Branchformer with enhanced merging for speech recognition” In 2022 IEEE Spoken Language Technology Workshop (SLT), 2023, pp. 84–91 IEEE
  36. Diederik P. Kingma and Max Welling “Auto-Encoding Variational Bayes” In 2nd International Conference on Learning Representations, ICLR 2014, Banff, AB, Canada, April 14-16, 2014, Conference Track Proceedings, 2014 arXiv:http://arxiv.org/abs/1312.6114v10 [stat.ML]
  37. Daniel S. Park “SpecAugment: A Simple Data Augmentation Method for Automatic Speech Recognition” In Proc. Interspeech 2019, 2019, pp. 2613–2617 DOI: 10.21437/Interspeech.2019-2680
  38. Shinji Watanabe “ESPnet: End-to-End Speech Processing Toolkit” In Proc. Interspeech 2018, 2018, pp. 2207–2211 DOI: 10.21437/Interspeech.2018-1456
  39. “Transformer-XL: Attentive Language Models beyond a Fixed-Length Context” In Proceedings of ACL, 2019, pp. 2978–2988 DOI: 10.18653/v1/P19-1285
  40. Diederik P Kingma and Jimmy Ba “Adam: A method for stochastic optimization” In Proc. International Conference on Learning Representations (ICLR), 2015
  41. Pengcheng Guo “Recent Developments on ESPnet Toolkit Boosted By Conformer” In Proc. 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2021, pp. 5874–5878 DOI: 10.1109/ICASSP39728.2021.9414858
  42. “A study of transducer based end-to-end ASR with ESPnet: Architecture, auxiliary loss and decoding strategies” In 2021 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), 2021, pp. 16–23 IEEE
  43. “Hybrid CTC/Attention Architecture for End-to-End Speech Recognition” In IEEE Journal of Selected Topics in Signal Processing 11.8, 2017, pp. 1240–1253 DOI: 10.1109/JSTSP.2017.2763455
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (4)
  1. Yuya Fujita (16 papers)
  2. Shinji Watanabe (416 papers)
  3. Xuankai Chang (61 papers)
  4. Takashi Maekaku (9 papers)
X Twitter Logo Streamline Icon: https://streamlinehq.com