Zipformer: A faster and better encoder for automatic speech recognition (2310.11230v4)

Published 17 Oct 2023 in eess.AS, cs.LG, and cs.SD

Abstract: The Conformer has become the most popular encoder model for automatic speech recognition (ASR). It adds convolution modules to a transformer to learn both local and global dependencies. In this work we describe a faster, more memory-efficient, and better-performing transformer, called Zipformer. Modeling changes include: 1) a U-Net-like encoder structure where middle stacks operate at lower frame rates; 2) reorganized block structure with more modules, within which we re-use attention weights for efficiency; 3) a modified form of LayerNorm called BiasNorm allows us to retain some length information; 4) new activation functions SwooshR and SwooshL work better than Swish. We also propose a new optimizer, called ScaledAdam, which scales the update by each tensor's current scale to keep the relative change about the same, and also explictly learns the parameter scale. It achieves faster convergence and better performance than Adam. Extensive experiments on LibriSpeech, Aishell-1, and WenetSpeech datasets demonstrate the effectiveness of our proposed Zipformer over other state-of-the-art ASR models. Our code is publicly available at https://github.com/k2-fsa/icefall.

References (44)

Citations (52)

View on Semantic Scholar

Summary

The paper introduces Zipformer, a novel encoder that uses a U-Net-like structure with temporal downsampling to reduce computational load while preserving temporal accuracy.
The paper details architectural innovations with the Zipformer block, utilizing non-linear attention and bypass modules to enhance global information capture and training stability.
The paper proposes optimization techniques with ScaledAdam and BiasNorm, leading to faster convergence and lower error rates on standard ASR benchmarks.

Zipformer: A Faster and Better Encoder for Automatic Speech Recognition

The presented paper introduces Zipformer, a novel encoder architecture for automatic speech recognition (ASR), designed as an advancement over the widely-used Conformer model. Zipformer addresses both performance and efficiency challenges by implementing several key structural and algorithmic enhancements.

Encoder Structure and Efficiency Improvements

Zipformer adopts a U-Net-like structure for its encoder, enabling operations at variable frame rates. Unlike Conformer, which works at a constant rate, Zipformer employs temporal downsampling through a sequence of six cascaded stacks, effectively reducing the frame rate progressively and executing computationally intensive operations at the lower frame rates. This approach not only enhances the computational efficiency but also maintains the integrity of temporal representations, reducing both the training time and resource usage significantly.

Architectural Enhancements

The authors propose a re-design of the basic Conformer block structure, introducing the Zipformer block. This block is characterized by:

Non-Linear Attention (NLA): Allows reuse of computed attention weights to improve global information capture without incurring extra computational overhead.
Bypass Modules: Facilitate efficient combination of inputs and outputs while enhancing stability and performance during training.

These adaptations collectively allow Zipformer to handle more sophisticated modeling tasks while efficiently managing memory and computational demands.

Optimization and Training Innovations

A distinguishing feature of the Zipformer is the introduction of ScaledAdam, a parameter-scale-invariant optimizer. This optimizer modulates the update steps based on the current scale of parameters, ensuring improved convergence speed compared to the standard Adam optimizer. The learning rate is managed using an Eden schedule, which avoids long warm-up periods, further accelerating the training process. This approach addresses challenges in parameter scale learning and model divergence, often encountered in large neural networks.

The authors also replace traditional LayerNorm with BiasNorm. This modified normalization approach retains length information post-normalization, effectively preventing issues of gradient oscillation and module dead zones observed with LayerNorm.

Experiments and Results

Comprehensive experiments conducted on prominent datasets like LibriSpeech, Aishell-1, and WenetSpeech exhibit Zipformer’s superior performance over existing state-of-the-art ASR models. Notably, Zipformer-M and Zipformer-L achieve lower word error rates (WERs) compared to Conformer and Squeezeformer models with reduced FLOPs, indicating a substantial efficiency gain. The results are robust across multiple configurations (e.g., CTC and pruned transducer) and scales.

Implications and Future Directions

The development of Zipformer signifies an important step towards optimizing ASR models for real-world applications, where computational efficiency is as critical as accuracy. The introduction of scalable architectures and adaptive optimization strategies could inspire future AI research, particularly in domains demanding high processing speeds and low resource consumption.

Theoretical and empirical insights from Zipformer could inform future deep learning models in various domains, promoting further innovations in encoder-decoder architectures. Future work might explore extending Zipformer’s architecture to other sequence modeling tasks, further validating its versatility and adaptability across different AI challenges.

Overall, this work contributes meaningful techniques to the ASR field and provides pathways for further exploration in enhancing model efficiency and capability.