State-of-the-Art Speech Recognition with Sequence-to-Sequence Models
In the paper titled "State-of-the-art Speech Recognition with Sequence-to-Sequence Models," the authors investigate significant advancements within the field of attention-based encoder-decoder architectures, specifically focusing on Listen, Attend, and Spell (LAS) models. The primary objective of this paper is the enhancement of automatic speech recognition (ASR) systems, particularly for the challenging task of voice search.
The authors introduce notable structural and optimization modifications to the LAS model, achieving remarkable improvements in Word Error Rate (WER). A key contribution of this work is the detailed comparison of traditional grapheme-based approaches with word piece models (WPM). They demonstrate that the substitution of graphemes with WPM affords better performance due to the stronger LLM properties of word pieces.
Further structural enhancement is achieved through the implementation of multi-head attention (MHA), which significantly elevates the model’s ability to attend to multiple input features simultaneously. This advancement is supported by a 13% relative improvement in WER attributed to MHA.
Optimization techniques are rigorously investigated as well. The paper details the use of synchronous training, which yields a more converged optimum with better stability compared to asynchronous training. Scheduled sampling (SS) and label smoothing (LS) are also employed, contributing to a reduction in overconfidence and providing regularization benefits. Perhaps most notable is the adoption of Minimum Word Error Rate (MWER) training, designed to directly optimize the sequence-level loss to be more congruent with the WER metric. This approach results in an additional 13.4% relative improvement in WER.
The culmination of these techniques results in a substantial enhancement in performance. The proposed LAS model achieves a WER of 5.6% on a 12,500-hour voice search task, outperforming the best conventional HMM-based system which records a WER of 6.7%. Similarly, on a dictation task, the LAS model achieves a WER of 4.1%, compared to the 5% WER of the conventional system.
Beyond structural and optimization improvements, the paper also explores the integration of an external LLM for second-pass rescoring. This step further polishes the final output, offering an additional 3.4% relative improvement in WER.
Theoretical implications of this research highlight the diminishing necessity of separate acoustic, pronunciation, and LLMs in ASR systems. By folding these components into a single, unified neural network, the complexity of ASR training is substantially alleviated, discarding the need for finite state transducers, lexicons, and text normalization pipelines traditionally required in conventional systems.
Practically, the introduction of word piece models and multi-head attention establishes more robust recognition capabilities especially in noisy and acoustically diverse environments. The synchronous training framework enhances the stability and convergence rate of large-scale neural network training, making the implementation more viable for real-world applications.
In forecasting future developments, one anticipates further exploration of streaming capabilities for such attention-based models. Adaptations like the Neural Transducer, which offer low-latency streaming decoding, are likely to be pivotal in making sequence-to-sequence models more practical for real-time applications.
Overall, this paper not only demonstrates the practicality of sequence-to-sequence models in surpassing state-of-the-art conventional ASR systems but also lays the groundwork for future innovations in the domain of end-to-end speech recognition.