Emformer: Efficient Memory Transformer Based Acoustic Model For Low Latency Streaming Speech Recognition (2010.10759v4)

Published 21 Oct 2020 in cs.SD, cs.CL, cs.LG, and eess.AS

Abstract: This paper proposes an efficient memory transformer Emformer for low latency streaming speech recognition. In Emformer, the long-range history context is distilled into an augmented memory bank to reduce self-attention's computation complexity. A cache mechanism saves the computation for the key and value in self-attention for the left context. Emformer applies a parallelized block processing in training to support low latency models. We carry out experiments on benchmark LibriSpeech data. Under average latency of 960 ms, Emformer gets WER $2.50\%$ on test-clean and $5.62\%$ on test-other. Comparing with a strong baseline augmented memory transformer (AM-TRF), Emformer gets $4.6$ folds training speedup and $18\%$ relative real-time factor (RTF) reduction in decoding with relative WER reduction $17\%$ on test-clean and $9\%$ on test-other. For a low latency scenario with an average latency of 80 ms, Emformer achieves WER $3.01\%$ on test-clean and $7.09\%$ on test-other. Comparing with the LSTM baseline with the same latency and model size, Emformer gets relative WER reduction $9\%$ and $16\%$ on test-clean and test-other, respectively.

PDF Abstract

Emformer: Efficient Memory Transformer for Low Latency Streaming Speech Recognition

This paper presents the Emformer, a novel efficient memory transformer designed for streaming speech recognition with low latency. The primary innovation introduced by Emformer is the use of an augmented memory bank that distills long-range history context, significantly reducing the computational complexity associated with self-attention mechanisms. Additionally, leveraging a cache mechanism allows the Emformer model to store key and value computations from the left context, minimizing redundancy and boosting efficiency.

Key Aspects of Emformer

The Emformer architecture incorporates several enhancements over the augmented memory transformer (AM-TRF):

Caching Strategy: Emformer employs a key and value caching mechanism for the left context from previous segments. This eliminates duplicated computations inherent in AM-TRF, offering substantial computational savings—up to 91% in low latency scenarios.
Parallelized Training: By enabling block processing in a parallel manner during training, Emformer overcomes the sequential limitations of AM-TRF. This adjustment is crucial for training models efficiently on GPUs, especially for scenarios requiring low latency.
Memory Carryover in Lower Layers: The model implements a technique for carrying over memory vectors from the lower layers of the architecture as opposed to repeating computations within the same layer. This change facilitates parallel processing and aids in stabilizing the model training process.
Improved Attention Mechanisms: The Emformer disables summary vector attention with the memory bank, improving model stability and recognition accuracy by preventing the over-weighting of distant contextual information. This adjustment addresses issues of gradient vanishing and explosion seen in recurrent neural networks.
Elimination of Right Context Leaking: The Emformer introduces a solution to the right context leaking problem by strategically duplicating look-ahead contexts at the input's start during training, ensuring each frame processes only current chunk information and designated context frames.

Experimental Results and Performance

The Emformer architecture was thoroughly evaluated using the LibriSpeech corpus, a benchmark dataset for speech recognition tasks. Key outcomes include:

Comparison with AM-TRF: Emformer demonstrated a 4.6-fold increase in training speed and an 18% reduction in real-time factor (RTF) during decoding. It achieved notable word error rate (WER) reductions, with relative decreases of 17% on test-clean and 9% on test-other subsets.
Low Latency Scenario: In conditions with average latency of 80 ms, Emformer attained impressive WERs of 3.01% on test-clean and 7.09% on test-other. This underscores its capacity for effective low latency applications in streaming speech recognition.
Superior Performance in Hybrid Systems: For hybrid systems, Emformer outperformed conventional LSTM models, achieving substantial gains in RTF and WER. Under average latency constraints, it reached a WER of 2.50% on test-clean and 5.62% on test-other.

Implications and Future Directions

The paper details significant progress in addressing computational efficiency and latency challenges inherent in streaming speech recognition systems. Emformer's improvements over traditional models represent a promising advancement in ASR applications, particularly for scenarios demanding real-time processing capabilities. This research paves the way for further exploration of transformer-based models, potentially investigating expanded memory mechanisms or nuanced attention strategies that balance precision with resource efficiency. Future work might also delve into applications beyond spoken language recognition, utilizing Emformer's architecture in diverse domains necessitating rapid and reliable streaming analysis.

PDF Markdown Bookmark Chat (Pro)

Authors (8)

Yangyang Shi (53 papers)
Yongqiang Wang (92 papers)
Chunyang Wu (24 papers)
Ching-Feng Yeh (22 papers)
Julian Chan (11 papers)
Frank Zhang (22 papers)
Duc Le (46 papers)
Mike Seltzer (12 papers)

Citations (165)

View on Semantic Scholar

Related Papers

Find Related Papers