Stateful Conformer with Cache-based Inference for Streaming Automatic Speech Recognition (2312.17279v3)
Abstract: In this paper, we propose an efficient and accurate streaming speech recognition model based on the FastConformer architecture. We adapted the FastConformer architecture for streaming applications through: (1) constraining both the look-ahead and past contexts in the encoder, and (2) introducing an activation caching mechanism to enable the non-autoregressive encoder to operate autoregressively during inference. The proposed model is thoughtfully designed in a way to eliminate the accuracy disparity between the train and inference time which is common for many streaming models. Furthermore, our proposed encoder works with various decoder configurations including Connectionist Temporal Classification (CTC) and RNN-Transducer (RNNT) decoders. Additionally, we introduced a hybrid CTC/RNNT architecture which utilizes a shared encoder with both a CTC and RNNT decoder to boost the accuracy and save computation. We evaluate the proposed model on LibriSpeech dataset and a multi-domain large scale dataset and demonstrate that it can achieve better accuracy with lower latency and inference time compared to a conventional buffered streaming model baseline. We also showed that training a model with multiple latencies can achieve better accuracy than single latency models while it enables us to support multiple latencies with a single model. Our experiments also showed the hybrid architecture would not only speedup the convergence of the CTC decoder but also improves the accuracy of streaming models compared to single decoder models.
- “Streaming end-to-end speech recognition for mobile devices,” in ICASSP, 2019.
- “Fast conformer with linearly scalable attention for efficient speech recognition,” ASRU, 2023.
- “Conformer: Convolution-augmented transformer for speech recognition,” InterSpeech, 2020.
- “Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks,” in ICML, 2006.
- Alex Graves, “Sequence transduction with recurrent neural networks,” arXiv e-prints, pp. arXiv–1211, 2012.
- “Nemo: a toolkit for building ai applications using neural modules,” arXiv preprint arXiv:1909.09577, 2019.
- “Transformer transducer: A streamable speech recognition model with transformer encoders and rnn-t loss,” in ICASSP, 2020.
- “Streaming automatic speech recognition with the transformer model,” in ICASSP, 2020.
- “Developing real-time streaming transformer transducer for speech recognition on large-scale dataset,” in ICASSP, 2021.
- “Streaming transformer asr with blockwise synchronous beam search,” in Spoken Language Technology Workshop (SLT), 2021.
- “Streaming transformer-based acoustic models using self-attention with augmented memory,” InterSpeech, 2020.
- “Enhancing monotonic multihead attention for streaming asr,” InterSpeech, 2020.
- “A better and faster end-to-end model for streaming asr,” in ICASSP, 2021.
- “Dual-mode ASR: Unify and improve streaming asr with full-context modeling,” ICLR, 2021.
- “Wenet: Production oriented streaming and non-streaming end-to-end speech recognition toolkit,” arXiv:2102.01547, 2021.
- “Batch normalization: Accelerating deep network training by reducing internal covariate shift,” in ICML. PMLR, 2015.
- “Layer normalization,” NIPS, 2016.
- “Multi-mode transformer transducer with stochastic future context,” in Interspeech, 2021.
- “Low latency end-to-end streaming speech recognition with a scout network,” InterSpeech, 2020.
- “Synchronous transformers for end-to-end speech recognition,” in ICASSP, 2020.
- “Librispeech: an asr corpus based on public domain audio books,” in ICASSP, 2015.
- “Spgispeech: 5,000 hours of transcribed financial audio for fully formatted end-to-end speech recognition,” in InterSpeech, 2021.
- “Earnings-22: A practical benchmark for accents in the wild,” arXiv preprint arXiv:2203.15591, 2022.
- “Gigaspeech: An evolving, multi-domain asr corpus with 10,000 hours of transcribed audio,” in InterSpeech, 2021.
- “Ted-lium: An automatic speech recognition dedicated corpus,” in LREC, 2012.
- “Common voice: A massively-multilingual speech corpus,” LREC, 2020.
- “Voxpopuli: A large-scale multilingual speech corpus for representation learning, semi-supervised learning and interpretation,” in ACL, 2021.
- “The ami meeting corpus,” in International Conference on Methods and Techniques in Behavioral Research, 2005.
- “Sentencepiece: A simple and language independent subword tokenizer and detokenizer for neural text processing,” EMNLP, 2018.
- “Decoupled weight decay regularization,” arXiv:1711.05101, 2017.
- “Attention is all you need,” NeurIPS, 2017.
- “Mixed precision training,” in ICLR, 2018.
- “Emformer: Efficient memory transformer based acoustic model for low latency streaming speech recognition,” in ICASSP, 2021.
- “Fastemit: Low-latency streaming asr with sequence-level emission regularization,” in ICASSP, 2021.
- NVIDIA-NeMo, “FastConformer Hybrid Large Streaming Multi (en-US),” https://huggingface.co/nvidia/stt_en_fastconformer_hybrid_large_streaming_multi, 2023.
- Vahid Noroozi (24 papers)
- Somshubra Majumdar (31 papers)
- Ankur Kumar (16 papers)
- Jagadeesh Balam (39 papers)
- Boris Ginsburg (111 papers)