Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
41 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Very Deep Self-Attention Networks for End-to-End Speech Recognition (1904.13377v2)

Published 30 Apr 2019 in cs.CL, cs.LG, cs.SD, and eess.AS

Abstract: Recently, end-to-end sequence-to-sequence models for speech recognition have gained significant interest in the research community. While previous architecture choices revolve around time-delay neural networks (TDNN) and long short-term memory (LSTM) recurrent neural networks, we propose to use self-attention via the Transformer architecture as an alternative. Our analysis shows that deep Transformer networks with high learning capacity are able to exceed performance from previous end-to-end approaches and even match the conventional hybrid systems. Moreover, we trained very deep models with up to 48 Transformer layers for both encoder and decoders combined with stochastic residual connections, which greatly improve generalizability and training efficiency. The resulting models outperform all previous end-to-end ASR approaches on the Switchboard benchmark. An ensemble of these models achieve 9.9% and 17.7% WER on Switchboard and CallHome test sets respectively. This finding brings our end-to-end models to competitive levels with previous hybrid systems. Further, with model ensembling the Transformers can outperform certain hybrid systems, which are more complicated in terms of both structure and training procedure.

Very Deep Self-Attention Networks for End-to-End Speech Recognition

This paper explores the utilization of the Transformer architecture, a model grounded in self-attention mechanisms, for end-to-end automatic speech recognition (ASR). Traditionally, end-to-end approaches in ASR have relied heavily on LSTM networks or TDNNs. However, this paper investigates the transformative potential of applying the Transformer model, traditionally used in NLP tasks, to the domain of speech recognition.

The core innovation lies in using the Transformer without integrating LSTM or convolutional components, historically considered indispensable for capturing temporal patterns in speech data. The research posits that Transformers, when constructed with a considerable depth—achieving up to 48 layers—can surpass prior end-to-end ASR models and even rival traditional hybrid systems. Notably, these very deep models outperformed existing end-to-end models on the Switchboard benchmark, achieving 9.9% WER on the Switchboard test set and 17.7% on the CallHome test set. These results suggest that the Transformer approach, augmented with stochastic residual layers, improves generalization and training efficiency.

Key Architectural Choices

The methodology hinges on sequence-to-sequence (S2S) architectures, where the Transformer serves both as an encoder and a decoder without any recurrence. The encoder is responsible for transforming input audio features into high-level representations, while the decoder generates the corresponding text sequence. This is facilitated through multi-head attention that scales with input length, in stark contrast to the recurrent dependencies seen in LSTMs.

Critically, the model benefits from a considerable depth. Through extensive experimentation, the authors conclude that deeper models demonstrate superior performance. For instance, models with 24 and 48 layers significantly outperform their shallower counterparts, demonstrating enhanced accuracy and generalization when combined with stochastic depth techniques—a form of dropout applied to entire layers. This technique mitigates the risks of overfitting alongside regular dropout and label smoothing.

Numerical Findings and Results

The empirical results highlight a substantial improvement in performance metrics across various configurations, underscoring the efficacy of deeper architectures. Notably, the paper presents a configuration with 48 layers, achieving a WER of 10.4% on the Switchboard test set, signifying a noteworthy improvement over previous benchmarks in end-to-end ASR.

Additionally, on the TED-LIUM 3 dataset, the stochastic 36-layer encoder and 12-layer decoder configuration yielded a test WER of 10.6%, showcasing the versatility and robustness of this approach across different datasets.

Implications and Future Directions

This research has significant implications for the field of ASR, suggesting that deep self-attention networks can indeed serve as formidable contenders to traditional systems. The findings pave the way for further exploration into even deeper architectures potentially combined with data augmentation techniques to yield even more reductions in word error rates.

Looking forward, it will be pertinent to investigate the deployment of such architectures in real-world scenarios, including live or streaming contexts where latency becomes critical. Future work could also explore hybrid strategies that integrate self-attention mechanisms with elements of traditional acoustic models to further refine ASR performance.

In summary, this paper lays foundational work for employing very deep self-attention networks in ASR, marking a significant stride toward more efficient and effective speech recognition models.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (6)
  1. Ngoc-Quan Pham (20 papers)
  2. Thai-Son Nguyen (13 papers)
  3. Jan Niehues (76 papers)
  4. Markus Müller (114 papers)
  5. Sebastian Stüker (11 papers)
  6. Alexander Waibel (45 papers)
Citations (159)
Youtube Logo Streamline Icon: https://streamlinehq.com