Zipformer: A faster and better encoder for automatic speech recognition (2310.11230v4)
Abstract: The Conformer has become the most popular encoder model for automatic speech recognition (ASR). It adds convolution modules to a transformer to learn both local and global dependencies. In this work we describe a faster, more memory-efficient, and better-performing transformer, called Zipformer. Modeling changes include: 1) a U-Net-like encoder structure where middle stacks operate at lower frame rates; 2) reorganized block structure with more modules, within which we re-use attention weights for efficiency; 3) a modified form of LayerNorm called BiasNorm allows us to retain some length information; 4) new activation functions SwooshR and SwooshL work better than Swish. We also propose a new optimizer, called ScaledAdam, which scales the update by each tensor's current scale to keep the relative change about the same, and also explictly learns the parameter scale. It achieves faster convergence and better performance than Adam. Extensive experiments on LibriSpeech, Aishell-1, and WenetSpeech datasets demonstrate the effectiveness of our proposed Zipformer over other state-of-the-art ASR models. Our code is publicly available at https://github.com/k2-fsa/icefall.
- Layer normalization. arXiv preprint arXiv:1607.06450, 2016.
- Aishell-1: An open-source mandarin speech corpus and a speech recognition baseline. In 20th conference of the oriental chapter of the international coordinating committee on speech databases and speech I/O systems and assessment (O-COCOSDA), pp. 1–5, 2017.
- Listen, attend and spell. arXiv preprint arXiv:1508.01211, 2015.
- Speech-transformer: a no-recurrence sequence-to-sequence model for speech recognition. In IEEE international conference on acoustics, speech and signal processing (ICASSP), pp. 5884–5888, 2018.
- Alex Graves. Sequence transduction with recurrent neural networks. arXiv preprint arXiv:1211.3711, 2012.
- Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks. In Proceedings of the 23rd international conference on Machine learning, pp. 369–376, 2006.
- Conformer: Convolution-augmented Transformer for Speech Recognition. In Proc. Interspeech 2020, pp. 5036–5040, 2020.
- Contextnet: Improving convolutional neural networks for automatic speech recognition with global context. arXiv preprint arXiv:2005.03191, 2020.
- Mobilenets: Efficient convolutional neural networks for mobile vision applications. arXiv preprint arXiv:1704.04861, 2017.
- Squeeze-and-excitation networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 7132–7141, 2018.
- Nextformer: A convnext augmented conformer for end-to-end speech recognition. arXiv preprint arXiv:2206.14747, 2022.
- Fast and parallel decoding for transducer. In ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 1–5. IEEE, 2023.
- A comparative study on transformer vs rnn in speech applications. In IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), pp. 449–456, 2019.
- E-branchformer: Branchformer with enhanced merging for speech recognition. In 2022 IEEE Spoken Language Technology Workshop (SLT), pp. 84–91. IEEE, 2023.
- Squeezeformer: An efficient transformer for automatic speech recognition. Advances in Neural Information Processing Systems, 35:9361–9373, 2022.
- Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
- Audio augmentation for speech recognition. In Sixteenth annual conference of the international speech communication association, 2015.
- Quartznet: Deep automatic speech recognition with 1d time-channel separable convolutions. In IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 6124–6128, 2020.
- Pruned rnn-t for fast, memory-efficient asr training. arXiv preprint arXiv:2206.13236, 2022.
- Nemo: a toolkit for building ai applications using neural modules. arXiv preprint arXiv:1909.09577, 2019.
- Jasper: An end-to-end convolutional neural acoustic model. arXiv preprint arXiv:1904.03288, 2019.
- A convnet for the 2020s. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 11976–11986, 2022.
- Alignment restricted streaming recurrent neural network transducer. In Proc. SLT. IEEE, 2021.
- Structured state space decoder for speech recognition and synthesis. In IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 1–5, 2023.
- Librispeech: an asr corpus based on public domain audio books. In IEEE international conference on acoustics, speech and signal processing (ICASSP), pp. 5206–5210, 2015.
- Specaugment: A simple data augmentation method for automatic speech recognition. arXiv preprint arXiv:1904.08779, 2019.
- Branchformer: Parallel mlp-attention architectures to capture local and global context for speech recognition and understanding. In International Conference on Machine Learning, pp. 17627–17643. PMLR, 2022.
- Searching for activation functions. arXiv preprint arXiv:1710.05941, 2017.
- Deepspeed: System optimizations enable training deep learning models with over 100 billion parameters. In Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, pp. 3505–3506, 2020.
- U-net: Convolutional networks for biomedical image segmentation. In Medical Image Computing and Computer-Assisted Intervention–MICCAI, pp. 234–241, 2015.
- Weight normalization: A simple reparameterization to accelerate training of deep neural networks. Advances in neural information processing systems, 29, 2016.
- Attention is all you need. Advances in neural information processing systems, 30, 2017.
- Transformer-based acoustic modeling for hybrid speech recognition. In IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 6874–6878, 2020.
- Accelerating rnn-t training and inference using ctc guidance. In Proc. ICASSP. IEEE, 2023.
- Hybrid ctc/attention architecture for end-to-end speech recognition. IEEE Journal of Selected Topics in Signal Processing, 11(8):1240–1253, 2017.
- ESPnet: End-to-end speech processing toolkit. In Proceedings of Interspeech, pp. 2207–2211, 2018.
- WeNet: Production Oriented Streaming and Non-Streaming End-to-End Speech Recognition Toolkit. In Proc. Interspeech, pp. 4054–4058, 2021.
- 3m: Multi-loss, multi-path and multi-level neural networks for speech recognition. In 2022 13th International Symposium on Chinese Spoken Language Processing (ISCSLP), pp. 170–174. IEEE, 2022.
- Lhotse: a speech data representation library for the modern deep learning ecosystem. arXiv preprint arXiv:2110.12561, 2021.
- Wenetspeech: A 10000+ hours multi-domain mandarin corpus for speech recognition. In IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 6182–6186. IEEE, 2022a.
- Wenet 2.0: More productive end-to-end speech recognition toolkit. arXiv preprint arXiv:2203.15455, 2022b.
- Faster, simpler and more accurate hybrid asr systems using wordpieces. arXiv preprint arXiv:2005.09150, 2020a.
- Transformer transducer: A streamable speech recognition model with transformer encoders and rnn-t loss. In IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 7829–7833, 2020b.
- Towards end-to-end speech recognition with deep convolutional neural networks. arXiv preprint arXiv:1701.02720, 2017.
Sponsor
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Top Community Prompts
Collections
Sign up for free to add this paper to one or more collections.