Emformer: Efficient Memory Transformer for Low Latency Streaming Speech Recognition
This paper presents the Emformer, a novel efficient memory transformer designed for streaming speech recognition with low latency. The primary innovation introduced by Emformer is the use of an augmented memory bank that distills long-range history context, significantly reducing the computational complexity associated with self-attention mechanisms. Additionally, leveraging a cache mechanism allows the Emformer model to store key and value computations from the left context, minimizing redundancy and boosting efficiency.
Key Aspects of Emformer
The Emformer architecture incorporates several enhancements over the augmented memory transformer (AM-TRF):
- Caching Strategy: Emformer employs a key and value caching mechanism for the left context from previous segments. This eliminates duplicated computations inherent in AM-TRF, offering substantial computational savingsāup to 91% in low latency scenarios.
- Parallelized Training: By enabling block processing in a parallel manner during training, Emformer overcomes the sequential limitations of AM-TRF. This adjustment is crucial for training models efficiently on GPUs, especially for scenarios requiring low latency.
- Memory Carryover in Lower Layers: The model implements a technique for carrying over memory vectors from the lower layers of the architecture as opposed to repeating computations within the same layer. This change facilitates parallel processing and aids in stabilizing the model training process.
- Improved Attention Mechanisms: The Emformer disables summary vector attention with the memory bank, improving model stability and recognition accuracy by preventing the over-weighting of distant contextual information. This adjustment addresses issues of gradient vanishing and explosion seen in recurrent neural networks.
- Elimination of Right Context Leaking: The Emformer introduces a solution to the right context leaking problem by strategically duplicating look-ahead contexts at the input's start during training, ensuring each frame processes only current chunk information and designated context frames.
Experimental Results and Performance
The Emformer architecture was thoroughly evaluated using the LibriSpeech corpus, a benchmark dataset for speech recognition tasks. Key outcomes include:
- Comparison with AM-TRF: Emformer demonstrated a 4.6-fold increase in training speed and an 18% reduction in real-time factor (RTF) during decoding. It achieved notable word error rate (WER) reductions, with relative decreases of 17% on test-clean and 9% on test-other subsets.
- Low Latency Scenario: In conditions with average latency of 80 ms, Emformer attained impressive WERs of 3.01% on test-clean and 7.09% on test-other. This underscores its capacity for effective low latency applications in streaming speech recognition.
- Superior Performance in Hybrid Systems: For hybrid systems, Emformer outperformed conventional LSTM models, achieving substantial gains in RTF and WER. Under average latency constraints, it reached a WER of 2.50% on test-clean and 5.62% on test-other.
Implications and Future Directions
The paper details significant progress in addressing computational efficiency and latency challenges inherent in streaming speech recognition systems. Emformer's improvements over traditional models represent a promising advancement in ASR applications, particularly for scenarios demanding real-time processing capabilities. This research paves the way for further exploration of transformer-based models, potentially investigating expanded memory mechanisms or nuanced attention strategies that balance precision with resource efficiency. Future work might also delve into applications beyond spoken language recognition, utilizing Emformer's architecture in diverse domains necessitating rapid and reliable streaming analysis.