Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
41 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Developing Real-time Streaming Transformer Transducer for Speech Recognition on Large-scale Dataset (2010.11395v3)

Published 22 Oct 2020 in cs.CL and eess.AS

Abstract: Recently, Transformer based end-to-end models have achieved great success in many areas including speech recognition. However, compared to LSTM models, the heavy computational cost of the Transformer during inference is a key issue to prevent their applications. In this work, we explored the potential of Transformer Transducer (T-T) models for the fist pass decoding with low latency and fast speed on a large-scale dataset. We combine the idea of Transformer-XL and chunk-wise streaming processing to design a streamable Transformer Transducer model. We demonstrate that T-T outperforms the hybrid model, RNN Transducer (RNN-T), and streamable Transformer attention-based encoder-decoder model in the streaming scenario. Furthermore, the runtime cost and latency can be optimized with a relatively small look-ahead.

Real-Time Streaming Transformer Transducer for Speech Recognition

The paper "Developing Real-Time Streaming Transformer Transducer for Speech Recognition on Large-Scale Dataset" introduces a novel approach to automatic speech recognition (ASR) using Transformer-based end-to-end (E2E) models, focusing on a streaming inference scenario. The research addresses significant challenges in deploying ASR models, such as latency, computational cost, and overall performance, by innovatively integrating a Transformer Transducer (T-T) model with efficient streaming capabilities.

The authors investigate the limitations inherent in Recurrent Neural Network Transducer (RNN-T) models, particularly in real-time applications, and present the Transformer Transducer (T-T) model as an effective alternative which surpasses RNN-T, the hybrid model, and the streamable Transformer attention-based encoder-decoder model in streaming scenarios. The paper leverages a large dataset of 65,000 hours of anonymized training data to benchmark the performance of the proposed model configurations.

A core contribution of the work is the design of a streamable Transformer Transducer model, which combines elements from Transformer-XL and chunk-wise streaming processing. The research introduces an optimized masking strategy that permits truncated history while allowing for limited lookahead, maintaining the computational efficiency and improving latency. This is crucial considering the quadratic computational complexity when full sequences are used, making it impractical for real-time recognition tasks.

Model Architecture

The proposed Transducer model consists of an acoustic encoder network using the Transformer architecture, a label predictor network using LSTM architecture, and a joint network that consolidates outputs from both the encoder and predictor components. By adopting a Transformer encoder, the model capitalizes on multi-head self-attention mechanisms, which provides flexible context handling through an adaptive attention mask design.

In an innovative stride, the authors optimize runtime costs and mitigate latency by implementing a caching mechanism. This reduces redundant computations by storing intermediate key-value pairs, facilitating faster inference through multi-frame parallelization.

Results and Evaluation

The paper reports a notable 10% relative improvement in word error rate (WER) over RNN-T and competing streaming models, with T-T achieving approximately 8.88% WER in zero lookahead tests. Furthermore, runtime efficiency is highlighted with T-T models achieving real-time factor (RTF) rates as low as 0.2 on CPU machines under small lookahead conditions.

While the Transformer encoder inherently increases computational resources, the careful truncation of historical data addresses this latency challenge robustly. The balance between latency and resource use is further refined through careful buffering strategies. Experiments with different batch sizes during testing demonstrate the scalability and adaptability of the proposed architecture, supporting its deployment in practical scenarios with controlled latency constraints.

Implications and Future Directions

The research paves the way for further advancements in real-time ASR, reflecting both practical and theoretical implications. The presented T-T model offers a viable path forward in contexts where both latency and accuracy are critical, such as mobile applications and real-time communication platforms.

On a theoretical level, this work encourages further exploration of integrating Transformer mechanisms within real-time E2E speech recognition and other related sequential tasks. Future work could explore extending the method to cover diverse acoustic environments or languages, improving generalization capability.

The findings exemplify a significant progression in striking a balance between accurate speech recognition and computational efficiency, fostering a wider application of E2E models in real-world scenarios.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (5)
  1. Xie Chen (165 papers)
  2. Yu Wu (196 papers)
  3. Zhenghao Wang (5 papers)
  4. Shujie Liu (101 papers)
  5. Jinyu Li (164 papers)
Citations (163)