Transformer Transducer: A Streamable Speech Recognition Model with Transformer Encoders and RNN-T Loss (2002.02562v2)

Published 7 Feb 2020 in eess.AS, cs.CL, and cs.SD

Abstract: In this paper we present an end-to-end speech recognition model with Transformer encoders that can be used in a streaming speech recognition system. Transformer computation blocks based on self-attention are used to encode both audio and label sequences independently. The activations from both audio and label encoders are combined with a feed-forward layer to compute a probability distribution over the label space for every combination of acoustic frame position and label history. This is similar to the Recurrent Neural Network Transducer (RNN-T) model, which uses RNNs for information encoding instead of Transformer encoders. The model is trained with the RNN-T loss well-suited to streaming decoding. We present results on the LibriSpeech dataset showing that limiting the left context for self-attention in the Transformer layers makes decoding computationally tractable for streaming, with only a slight degradation in accuracy. We also show that the full attention version of our model beats the-state-of-the art accuracy on the LibriSpeech benchmarks. Our results also show that we can bridge the gap between full attention and limited attention versions of our model by attending to a limited number of future frames.

PDF Abstract

Transformer Transducer: A Streamable Speech Recognition Model with Transformer Encoders and RNN-T Loss

The paper presents a novel approach to speech recognition through the introduction of the Transformer Transducer (T-T) model. This model combines the strength of Transformer encoders with RNN-T (Recurrent Neural Network Transducer) architecture, employing a streaming-friendly loss framework.

Overview

Traditionally, RNNs have been the preferred choice for automatic speech recognition (ASR) due to their ability to effectively capture temporal dependencies in audio features. However, the computational cost of these models can be restrictive for streaming applications. This paper innovates by substituting RNNs with Transformer encoders, allowing the model to benefit from self-attention mechanisms that enhance parallelism and reduce training times.

The proposed model employs Transformer encoding for both audio and label sequences independently. The architecture maintains constant computation per frame, which is critical for deploying in real-time scenarios. The self-attention layers are configured to mitigate issues of increased computational burden due to longer sequence lengths, which is typically a challenge with Transformer models.

Model Architecture

The T-T model maintains the encoder-decoder framework of RNN-T, with audio and label sequences processed independently through self-attention based Transformer layers. These transformations are combined via a feed-forward network to predict label sequences, supported by RNN-T loss which efficiently adjusts alignments and is naturally suited for streaming applications.

Key Components:

Audio and Label Encoders: Utilize Transformer layers, replacing LSTMs traditionally used in RNN-T systems.
Joint Network: Combines encoder outputs for probability distribution over labels.
Self-Attention Mechanics: Restricted to prior states to accommodate streaming requirements while maintaining model efficiency.

Experimental Results

The experimental evaluation on the LibriSpeech dataset reveals several key findings:

Performance: The full attention variant of the T-T model surpasses state-of-the-art results on the LibriSpeech benchmarks.
Training Efficiency: Transformer Transducers train significantly faster than LSTM-based RNN-T models, with the T-T architecture achieving competitive accuracy with fewer resources.
Streamability: The paper introduces a limited context attention mechanism, ensuring constant time complexity for frame processing, crucial for streamable systems.

Tables detailing WERs (Word Error Rates) indicate that the T-T model can achieve remarkable results, especially when using limited future frame attention, effectively balancing latency and accuracy.

Implications and Future Directions

The implications of this research are profound, providing a framework that successfully integrates Transformer architectures in streaming applications. By demonstrating the ability to manage computational efficiency while retaining high accuracy, this paper introduces a flexible and powerful tool for advancing ASR technologies.

Theoretical implications involve the prospects of further integrating attention mechanisms with loss frameworks conducive to sequence alignments. Practically, the ability to optimize streaming ASR models with reduced latency impacts a range of applications, from real-time voice assistants to mobile device speech recognition.

Future work may explore:

Layer-specific Context Masking: Experimentation with different contexts for various layers.
Extension to Other Sequence Tasks: Applying the architecture to alternate sequence-based challenges such as machine translation or LLMing might yield innovative insights.

In conclusion, the Transformer Transducer model represents a significant step in evolving ASR technologies, merging end-to-end learning paradigms with efficient, real-time processing capabilities.

PDF Markdown Bookmark Chat (Pro)

Authors (7)

Qian Zhang (308 papers)
Han Lu (32 papers)
Hasim Sak (15 papers)
Anshuman Tripathi (15 papers)
Erik McDermott (9 papers)
Stephen Koo (1 paper)
Shankar Kumar (34 papers)

Citations (460)

View on Semantic Scholar