Investigating End-to-End ASR Architectures for Long Form Audio Transcription (2309.09950v2)
Abstract: This paper presents an overview and evaluation of some of the end-to-end ASR models on long-form audios. We study three categories of Automatic Speech Recognition(ASR) models based on their core architecture: (1) convolutional, (2) convolutional with squeeze-and-excitation and (3) convolutional models with attention. We selected one ASR model from each category and evaluated Word Error Rate, maximum audio length and real-time factor for each model on a variety of long audio benchmarks: Earnings-21 and 22, CORAAL, and TED-LIUM3. The model from the category of self-attention with local attention and global token has the best accuracy comparing to other architectures. We also compared models with CTC and RNNT decoders and showed that CTC-based models are more robust and efficient than RNNT on long form audio.
- “Conformer: Convolution-augmented Transformer for Speech Recognition,” in Interspeech, 2020.
- “NeMo: a toolkit for Conversational AI and Large Language Models,” .
- “E2e segmentation in a two-pass cascaded encoder asr model,” in ICASSP, 2023.
- “E2e segmenter: Joint segmenting and decoding for long-form asr,” arXiv:2204.10749, 2022.
- “A comparison of end-to-end models for long-form speech recognition,” in 2019 IEEE automatic speech recognition and understanding workshop (ASRU). IEEE, 2019, pp. 889–896.
- NVIDIA NeMo, “Streaming / Buffered ASR,” https://github.com/NVIDIA/NeMo/tree/main/examples/asr/asr_chunked_inference, 2022.
- “QuartzNet: Deep automatic speech recognition with 1d time-channel separable convolutions,” in ICASSP, 2020.
- “ContextNet: Improving convolutional neural networks for automatic speech recognition with global context,” in Interspeech, 2020.
- “Citrinet: Closing the gap between non-autoregressive and autoregressive end-to-end models for automatic speech recognition,” arXiv:2104.01721, 2021.
- “Fast conformer with linearly scalable attention for efficient speech recognition,” arXiv:2305.05084, 2023.
- “Recognizing long-form speech using streaming end-to-end models,” in 2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), 2019, pp. 920–927.
- “A better and faster end-to-end model for streaming ASR,” in ICASSP, 2021.
- “Dual-mode ASR: Unify and improve streaming ASR with full-context modeling,” in ICLR, 2021.
- “Wenet: Production oriented streaming and non-streaming end-to-end speech recognition toolkit,” in Interspeech, 2021.
- “Context-aware end-to-end asr using self-attentive embedding and tensor fusion,” in ICASSP 2023 - 2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2023, pp. 1–5.
- “Advanced long-context end-to-end speech recognition using context-expanded transformers,” 2021.
- “Wav2letter: an end-to-end convnet-based speech recognition system,” in ICLR, 2016.
- “Jasper: An End-to-End Convolutional Neural Acoustic Model,” in Interspeech, 2019.
- Alex Graves, “Sequence transduction with recurrent neural networks,” in ICML, 2012.
- “Searching for activation functions,” in ICLR, 2017.
- “Librispeech: an ASR corpus based on public domain audio books,” in ICASSP, 2015.
- “MLS: A large-scale multilingual dataset for speech research,” in Interspeech, 2020.
- “Mozilla: A journey to less than 10% word error rate,” https://hacks.mozilla.org/2017/11/a-journey-to-10-word-error-rate/, Accessed: 2018-04-06.
- “The design for the Wall Street Journal based CSR corpus,” in Proc. of the workshop on Speech and Natural Language. ACL, 1992.
- “Fisher english training speech part 1 transcripts,” Philadelphia: Linguistic Data Consortium, 2004.
- Wiltrud Mihatsch, “Godfrey, john and holliman, edward. 1997. switchboard-l release 2. philadelphia, pa: Linguis,” .
- “Building the singapore english national speech corpus,” Malay, vol. 20, no. 25.0, pp. 19–3, 2019.
- “VoxPopuli: A large-scale multilingual speech corpus for representation learning, semi-supervised learning and interpretation,” in Proc. of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), 2021.
- “Cstr vctk corpus: English multi-speaker corpus for cstr voice cloning toolkit (version 0.92),” University of Edinburgh. The Centre for Speech Technology Research (CSTR), 2019.
- “Europarl-ASR: A Large Corpus of Parliamentary Debates for Streaming ASR Benchmarking and Speech Data Filtering/Verbatimization,” in Interspeech, 2021.
- “The people’s speech: A large-scale diverse english speech recognition dataset for commercial usage,” arXiv:2111.09344, 2021.
- “TED-LIUM 3: Twice as much data and corpus repartition for experiments on speaker adaptation,” in SPECOM, 2018.
- “Earnings-21: A practical benchmark for ASR in the wild,” arXiv:2104.11348, 2021.
- revdotcom, “speech-datasets,” 6 2022.
- “The corpus of regional african american language,” 2021.
- “Robust speech recognition via large-scale weak supervision,” in ICML, 2023.
- “Exploring architectures, data and units for streaming end-to-end speech recognition with RNN-Transducer,” in ASRU, 2017.
- “Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks,” in ICML, 2006.