Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
129 tokens/sec
GPT-4o
28 tokens/sec
Gemini 2.5 Pro Pro
42 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Investigating End-to-End ASR Architectures for Long Form Audio Transcription (2309.09950v2)

Published 18 Sep 2023 in eess.AS and cs.SD

Abstract: This paper presents an overview and evaluation of some of the end-to-end ASR models on long-form audios. We study three categories of Automatic Speech Recognition(ASR) models based on their core architecture: (1) convolutional, (2) convolutional with squeeze-and-excitation and (3) convolutional models with attention. We selected one ASR model from each category and evaluated Word Error Rate, maximum audio length and real-time factor for each model on a variety of long audio benchmarks: Earnings-21 and 22, CORAAL, and TED-LIUM3. The model from the category of self-attention with local attention and global token has the best accuracy comparing to other architectures. We also compared models with CTC and RNNT decoders and showed that CTC-based models are more robust and efficient than RNNT on long form audio.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (38)
  1. “Conformer: Convolution-augmented Transformer for Speech Recognition,” in Interspeech, 2020.
  2. “NeMo: a toolkit for Conversational AI and Large Language Models,” .
  3. “E2e segmentation in a two-pass cascaded encoder asr model,” in ICASSP, 2023.
  4. “E2e segmenter: Joint segmenting and decoding for long-form asr,” arXiv:2204.10749, 2022.
  5. “A comparison of end-to-end models for long-form speech recognition,” in 2019 IEEE automatic speech recognition and understanding workshop (ASRU). IEEE, 2019, pp. 889–896.
  6. NVIDIA NeMo, “Streaming / Buffered ASR,” https://github.com/NVIDIA/NeMo/tree/main/examples/asr/asr_chunked_inference, 2022.
  7. “QuartzNet: Deep automatic speech recognition with 1d time-channel separable convolutions,” in ICASSP, 2020.
  8. “ContextNet: Improving convolutional neural networks for automatic speech recognition with global context,” in Interspeech, 2020.
  9. “Citrinet: Closing the gap between non-autoregressive and autoregressive end-to-end models for automatic speech recognition,” arXiv:2104.01721, 2021.
  10. “Fast conformer with linearly scalable attention for efficient speech recognition,” arXiv:2305.05084, 2023.
  11. “Recognizing long-form speech using streaming end-to-end models,” in 2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), 2019, pp. 920–927.
  12. “A better and faster end-to-end model for streaming ASR,” in ICASSP, 2021.
  13. “Dual-mode ASR: Unify and improve streaming ASR with full-context modeling,” in ICLR, 2021.
  14. “Wenet: Production oriented streaming and non-streaming end-to-end speech recognition toolkit,” in Interspeech, 2021.
  15. “Context-aware end-to-end asr using self-attentive embedding and tensor fusion,” in ICASSP 2023 - 2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2023, pp. 1–5.
  16. “Advanced long-context end-to-end speech recognition using context-expanded transformers,” 2021.
  17. “Wav2letter: an end-to-end convnet-based speech recognition system,” in ICLR, 2016.
  18. “Jasper: An End-to-End Convolutional Neural Acoustic Model,” in Interspeech, 2019.
  19. Alex Graves, “Sequence transduction with recurrent neural networks,” in ICML, 2012.
  20. “Searching for activation functions,” in ICLR, 2017.
  21. “Librispeech: an ASR corpus based on public domain audio books,” in ICASSP, 2015.
  22. “MLS: A large-scale multilingual dataset for speech research,” in Interspeech, 2020.
  23. “Mozilla: A journey to less than 10% word error rate,” https://hacks.mozilla.org/2017/11/a-journey-to-10-word-error-rate/, Accessed: 2018-04-06.
  24. “The design for the Wall Street Journal based CSR corpus,” in Proc. of the workshop on Speech and Natural Language. ACL, 1992.
  25. “Fisher english training speech part 1 transcripts,” Philadelphia: Linguistic Data Consortium, 2004.
  26. Wiltrud Mihatsch, “Godfrey, john and holliman, edward. 1997. switchboard-l release 2. philadelphia, pa: Linguis,” .
  27. “Building the singapore english national speech corpus,” Malay, vol. 20, no. 25.0, pp. 19–3, 2019.
  28. “VoxPopuli: A large-scale multilingual speech corpus for representation learning, semi-supervised learning and interpretation,” in Proc. of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), 2021.
  29. “Cstr vctk corpus: English multi-speaker corpus for cstr voice cloning toolkit (version 0.92),” University of Edinburgh. The Centre for Speech Technology Research (CSTR), 2019.
  30. “Europarl-ASR: A Large Corpus of Parliamentary Debates for Streaming ASR Benchmarking and Speech Data Filtering/Verbatimization,” in Interspeech, 2021.
  31. “The people’s speech: A large-scale diverse english speech recognition dataset for commercial usage,” arXiv:2111.09344, 2021.
  32. “TED-LIUM 3: Twice as much data and corpus repartition for experiments on speaker adaptation,” in SPECOM, 2018.
  33. “Earnings-21: A practical benchmark for ASR in the wild,” arXiv:2104.11348, 2021.
  34. revdotcom, “speech-datasets,” 6 2022.
  35. “The corpus of regional african american language,” 2021.
  36. “Robust speech recognition via large-scale weak supervision,” in ICML, 2023.
  37. “Exploring architectures, data and units for streaming end-to-end speech recognition with RNN-Transducer,” in ASRU, 2017.
  38. “Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks,” in ICML, 2006.
Citations (6)

Summary

We haven't generated a summary for this paper yet.

X Twitter Logo Streamline Icon: https://streamlinehq.com