Papers
Topics
Authors
Recent
2000 character limit reached

Zipformer: A faster and better encoder for automatic speech recognition (2310.11230v4)

Published 17 Oct 2023 in eess.AS, cs.LG, and cs.SD

Abstract: The Conformer has become the most popular encoder model for automatic speech recognition (ASR). It adds convolution modules to a transformer to learn both local and global dependencies. In this work we describe a faster, more memory-efficient, and better-performing transformer, called Zipformer. Modeling changes include: 1) a U-Net-like encoder structure where middle stacks operate at lower frame rates; 2) reorganized block structure with more modules, within which we re-use attention weights for efficiency; 3) a modified form of LayerNorm called BiasNorm allows us to retain some length information; 4) new activation functions SwooshR and SwooshL work better than Swish. We also propose a new optimizer, called ScaledAdam, which scales the update by each tensor's current scale to keep the relative change about the same, and also explictly learns the parameter scale. It achieves faster convergence and better performance than Adam. Extensive experiments on LibriSpeech, Aishell-1, and WenetSpeech datasets demonstrate the effectiveness of our proposed Zipformer over other state-of-the-art ASR models. Our code is publicly available at https://github.com/k2-fsa/icefall.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (44)
  1. Layer normalization. arXiv preprint arXiv:1607.06450, 2016.
  2. Aishell-1: An open-source mandarin speech corpus and a speech recognition baseline. In 20th conference of the oriental chapter of the international coordinating committee on speech databases and speech I/O systems and assessment (O-COCOSDA), pp.  1–5, 2017.
  3. Listen, attend and spell. arXiv preprint arXiv:1508.01211, 2015.
  4. Speech-transformer: a no-recurrence sequence-to-sequence model for speech recognition. In IEEE international conference on acoustics, speech and signal processing (ICASSP), pp.  5884–5888, 2018.
  5. Alex Graves. Sequence transduction with recurrent neural networks. arXiv preprint arXiv:1211.3711, 2012.
  6. Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks. In Proceedings of the 23rd international conference on Machine learning, pp.  369–376, 2006.
  7. Conformer: Convolution-augmented Transformer for Speech Recognition. In Proc. Interspeech 2020, pp.  5036–5040, 2020.
  8. Contextnet: Improving convolutional neural networks for automatic speech recognition with global context. arXiv preprint arXiv:2005.03191, 2020.
  9. Mobilenets: Efficient convolutional neural networks for mobile vision applications. arXiv preprint arXiv:1704.04861, 2017.
  10. Squeeze-and-excitation networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp.  7132–7141, 2018.
  11. Nextformer: A convnext augmented conformer for end-to-end speech recognition. arXiv preprint arXiv:2206.14747, 2022.
  12. Fast and parallel decoding for transducer. In ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp.  1–5. IEEE, 2023.
  13. A comparative study on transformer vs rnn in speech applications. In IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), pp.  449–456, 2019.
  14. E-branchformer: Branchformer with enhanced merging for speech recognition. In 2022 IEEE Spoken Language Technology Workshop (SLT), pp.  84–91. IEEE, 2023.
  15. Squeezeformer: An efficient transformer for automatic speech recognition. Advances in Neural Information Processing Systems, 35:9361–9373, 2022.
  16. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
  17. Audio augmentation for speech recognition. In Sixteenth annual conference of the international speech communication association, 2015.
  18. Quartznet: Deep automatic speech recognition with 1d time-channel separable convolutions. In IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp.  6124–6128, 2020.
  19. Pruned rnn-t for fast, memory-efficient asr training. arXiv preprint arXiv:2206.13236, 2022.
  20. Nemo: a toolkit for building ai applications using neural modules. arXiv preprint arXiv:1909.09577, 2019.
  21. Jasper: An end-to-end convolutional neural acoustic model. arXiv preprint arXiv:1904.03288, 2019.
  22. A convnet for the 2020s. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp.  11976–11986, 2022.
  23. Alignment restricted streaming recurrent neural network transducer. In Proc. SLT. IEEE, 2021.
  24. Structured state space decoder for speech recognition and synthesis. In IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp.  1–5, 2023.
  25. Librispeech: an asr corpus based on public domain audio books. In IEEE international conference on acoustics, speech and signal processing (ICASSP), pp.  5206–5210, 2015.
  26. Specaugment: A simple data augmentation method for automatic speech recognition. arXiv preprint arXiv:1904.08779, 2019.
  27. Branchformer: Parallel mlp-attention architectures to capture local and global context for speech recognition and understanding. In International Conference on Machine Learning, pp.  17627–17643. PMLR, 2022.
  28. Searching for activation functions. arXiv preprint arXiv:1710.05941, 2017.
  29. Deepspeed: System optimizations enable training deep learning models with over 100 billion parameters. In Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, pp.  3505–3506, 2020.
  30. U-net: Convolutional networks for biomedical image segmentation. In Medical Image Computing and Computer-Assisted Intervention–MICCAI, pp.  234–241, 2015.
  31. Weight normalization: A simple reparameterization to accelerate training of deep neural networks. Advances in neural information processing systems, 29, 2016.
  32. Attention is all you need. Advances in neural information processing systems, 30, 2017.
  33. Transformer-based acoustic modeling for hybrid speech recognition. In IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp.  6874–6878, 2020.
  34. Accelerating rnn-t training and inference using ctc guidance. In Proc. ICASSP. IEEE, 2023.
  35. Hybrid ctc/attention architecture for end-to-end speech recognition. IEEE Journal of Selected Topics in Signal Processing, 11(8):1240–1253, 2017.
  36. ESPnet: End-to-end speech processing toolkit. In Proceedings of Interspeech, pp.  2207–2211, 2018.
  37. WeNet: Production Oriented Streaming and Non-Streaming End-to-End Speech Recognition Toolkit. In Proc. Interspeech, pp.  4054–4058, 2021.
  38. 3m: Multi-loss, multi-path and multi-level neural networks for speech recognition. In 2022 13th International Symposium on Chinese Spoken Language Processing (ISCSLP), pp.  170–174. IEEE, 2022.
  39. Lhotse: a speech data representation library for the modern deep learning ecosystem. arXiv preprint arXiv:2110.12561, 2021.
  40. Wenetspeech: A 10000+ hours multi-domain mandarin corpus for speech recognition. In IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp.  6182–6186. IEEE, 2022a.
  41. Wenet 2.0: More productive end-to-end speech recognition toolkit. arXiv preprint arXiv:2203.15455, 2022b.
  42. Faster, simpler and more accurate hybrid asr systems using wordpieces. arXiv preprint arXiv:2005.09150, 2020a.
  43. Transformer transducer: A streamable speech recognition model with transformer encoders and rnn-t loss. In IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp.  7829–7833, 2020b.
  44. Towards end-to-end speech recognition with deep convolutional neural networks. arXiv preprint arXiv:1701.02720, 2017.
Citations (52)

Summary

  • The paper introduces Zipformer, a novel encoder that uses a U-Net-like structure with temporal downsampling to reduce computational load while preserving temporal accuracy.
  • The paper details architectural innovations with the Zipformer block, utilizing non-linear attention and bypass modules to enhance global information capture and training stability.
  • The paper proposes optimization techniques with ScaledAdam and BiasNorm, leading to faster convergence and lower error rates on standard ASR benchmarks.

Zipformer: A Faster and Better Encoder for Automatic Speech Recognition

The presented paper introduces Zipformer, a novel encoder architecture for automatic speech recognition (ASR), designed as an advancement over the widely-used Conformer model. Zipformer addresses both performance and efficiency challenges by implementing several key structural and algorithmic enhancements.

Encoder Structure and Efficiency Improvements

Zipformer adopts a U-Net-like structure for its encoder, enabling operations at variable frame rates. Unlike Conformer, which works at a constant rate, Zipformer employs temporal downsampling through a sequence of six cascaded stacks, effectively reducing the frame rate progressively and executing computationally intensive operations at the lower frame rates. This approach not only enhances the computational efficiency but also maintains the integrity of temporal representations, reducing both the training time and resource usage significantly.

Architectural Enhancements

The authors propose a re-design of the basic Conformer block structure, introducing the Zipformer block. This block is characterized by:

  1. Non-Linear Attention (NLA): Allows reuse of computed attention weights to improve global information capture without incurring extra computational overhead.
  2. Bypass Modules: Facilitate efficient combination of inputs and outputs while enhancing stability and performance during training.

These adaptations collectively allow Zipformer to handle more sophisticated modeling tasks while efficiently managing memory and computational demands.

Optimization and Training Innovations

A distinguishing feature of the Zipformer is the introduction of ScaledAdam, a parameter-scale-invariant optimizer. This optimizer modulates the update steps based on the current scale of parameters, ensuring improved convergence speed compared to the standard Adam optimizer. The learning rate is managed using an Eden schedule, which avoids long warm-up periods, further accelerating the training process. This approach addresses challenges in parameter scale learning and model divergence, often encountered in large neural networks.

The authors also replace traditional LayerNorm with BiasNorm. This modified normalization approach retains length information post-normalization, effectively preventing issues of gradient oscillation and module dead zones observed with LayerNorm.

Experiments and Results

Comprehensive experiments conducted on prominent datasets like LibriSpeech, Aishell-1, and WenetSpeech exhibit Zipformer’s superior performance over existing state-of-the-art ASR models. Notably, Zipformer-M and Zipformer-L achieve lower word error rates (WERs) compared to Conformer and Squeezeformer models with reduced FLOPs, indicating a substantial efficiency gain. The results are robust across multiple configurations (e.g., CTC and pruned transducer) and scales.

Implications and Future Directions

The development of Zipformer signifies an important step towards optimizing ASR models for real-world applications, where computational efficiency is as critical as accuracy. The introduction of scalable architectures and adaptive optimization strategies could inspire future AI research, particularly in domains demanding high processing speeds and low resource consumption.

Theoretical and empirical insights from Zipformer could inform future deep learning models in various domains, promoting further innovations in encoder-decoder architectures. Future work might explore extending Zipformer’s architecture to other sequence modeling tasks, further validating its versatility and adaptability across different AI challenges.

Overall, this work contributes meaningful techniques to the ASR field and provides pathways for further exploration in enhancing model efficiency and capability.

Slide Deck Streamline Icon: https://streamlinehq.com

Whiteboard

Dice Question Streamline Icon: https://streamlinehq.com

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

List To Do Tasks Checklist Streamline Icon: https://streamlinehq.com

Collections

Sign up for free to add this paper to one or more collections.

Github Logo Streamline Icon: https://streamlinehq.com

GitHub

X Twitter Logo Streamline Icon: https://streamlinehq.com

Tweets

Sign up for free to view the 1 tweet with 0 likes about this paper.

Youtube Logo Streamline Icon: https://streamlinehq.com