Fast Conformer with Linearly Scalable Attention for Efficient Speech Recognition (2305.05084v6)
Abstract: Conformer-based models have become the dominant end-to-end architecture for speech processing tasks. With the objective of enhancing the conformer architecture for efficient training and inference, we carefully redesigned Conformer with a novel downsampling schema. The proposed model, named Fast Conformer(FC), is 2.8x faster than the original Conformer, supports scaling to Billion parameters without any changes to the core architecture and also achieves state-of-the-art accuracy on Automatic Speech Recognition benchmarks. To enable transcription of long-form speech up to 11 hours, we replaced global attention with limited context attention post-training, while also improving accuracy through fine-tuning with the addition of a global token. Fast Conformer, when combined with a Transformer decoder also outperforms the original Conformer in accuracy and in speed for Speech Translation and Spoken Language Understanding.
- “Conformer: Convolution-augmented transformer for speech recognition,” in Interspeech, 2020.
- “Recent developments on Espnet toolkit boosted by Conformer,” in ICASSP, 2021.
- “QuartzNet: Deep automatic speech recognition with 1D time-channel separable convolutions,” in ICASSP, 2020.
- “Pushing the limits of semi-supervised learning for automatic speech recognition,” arXiv:2010.10504, 2020.
- “Longformer: The long-document Transformer,” arXiv:2004.05150, 2020.
- “Attention is all you need,” in NeurIPS, 2017.
- F. Chollet, “Xception: Deep learning with depthwise separable convolutions,” in CVPR, 2017.
- “Squeezeformer: An efficient Transformer for automatic speech recognition,” in NeurIPS, 2022.
- “Efficient Conformer: Progressive downsampling and grouped attention for automatic speech recognition,” in Automatic Speech Recognition and Understanding Workshop (ASRU), 2021.
- “Uconv-conformer: High reduction of input sequence length for end-to-end speech recognition,” in ICASSP, 2022.
- “Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks,” in ICML, 2006.
- A. Graves, “Sequence transduction with recurrent neural networks,” in ICML, 2012.
- “Librispeech: an ASR corpus based on public domain audio books,” in ICASSP, 2015.
- “SentencePiece: A simple and language independent subword tokenizer and detokenizer for neural text processing,” arXiv:1808.06226, 2018.
- François Chollet, “Xception: Deep learning with depthwise separable convolutions,” in CVPR, 2017.
- “MLS: A large-scale multilingual dataset for speech research,” Interspeech, 2020.
- “Mozilla: A journey to less than 10% word error rate,” https://hacks.mozilla.org/2017/11/a-journey-to-10-word-error-rate/.
- “The design for the Wall Street Journal based CSR corpus,” in Proc. of the workshop on Speech and Natural Language. ACL, 1992, pp. 357––362.
- “Findings of the iwslt 2022 evaluation campaign,” in IWSLT, 2022.
- “NVIDIA NeMo Neural Machine Translation Systems for English-German and English-Russian News and Biomedical Tasks at WMT21,” arXiv:2111.08634, 2021.
- “MuST-C: A multilingual corpus for end-to-end speech translation,” Computer Speech & Language, vol. 66, pp. 101155, 2021.
- “HuBERT: Self-supervised speech representation learning by masked prediction of hidden units,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 29, pp. 3451–3460, 2021.
- “Libri-Light: A benchmark for ASR with limited or no supervision,” in ICASSP, 2020.
- “SLURP: A Spoken Language Understanding Resource Package,” in EMNLP, 2020.
- “ESPnet-SLU: Advancing spoken language understanding through ESPnet,” in ICASSP, 2022.
- “A fine-tuned Wav2vec 2.0/HuBERT benchmark for speech emotion recognition, speaker verification and spoken language understanding,” arXiv:2111.02735, 2021.
- “Ted-lium 3: Twice as much data and corpus repartition for experiments on speaker adaptation,” in Speech and Computer: SPECOM 2018. Springer, 2018.
- “Earnings-21: A practical benchmark for ASR in the wild,” in Interspeech, 2021.
- “wav2vec 2.0: A framework for self-supervised learning of speech representations,” in NeurIPS, 2020.
- “Open automatic speech recognition leaderboard,” https://huggingface.co/spaces/huggingface.co/spaces/open-asr-leaderboard/leaderboard, 2023.
- “Robust speech recognition via large-scale weak supervision,” 2022.
- Dima Rekesh (7 papers)
- Nithin Rao Koluguri (17 papers)
- Samuel Kriman (8 papers)
- Somshubra Majumdar (31 papers)
- Vahid Noroozi (24 papers)
- He Huang (97 papers)
- Oleksii Hrinchuk (20 papers)
- Krishna Puvvada (2 papers)
- Ankur Kumar (16 papers)
- Jagadeesh Balam (39 papers)
- Boris Ginsburg (111 papers)