State-of-the-art Speech Recognition With Sequence-to-Sequence Models (1712.01769v6)

Published 5 Dec 2017 in cs.CL, cs.SD, eess.AS, and stat.ML

Abstract: Attention-based encoder-decoder architectures such as Listen, Attend, and Spell (LAS), subsume the acoustic, pronunciation and LLM components of a traditional automatic speech recognition (ASR) system into a single neural network. In previous work, we have shown that such architectures are comparable to state-of-theart ASR systems on dictation tasks, but it was not clear if such architectures would be practical for more challenging tasks such as voice search. In this work, we explore a variety of structural and optimization improvements to our LAS model which significantly improve performance. On the structural side, we show that word piece models can be used instead of graphemes. We also introduce a multi-head attention architecture, which offers improvements over the commonly-used single-head attention. On the optimization side, we explore synchronous training, scheduled sampling, label smoothing, and minimum word error rate optimization, which are all shown to improve accuracy. We present results with a unidirectional LSTM encoder for streaming recognition. On a 12, 500 hour voice search task, we find that the proposed changes improve the WER from 9.2% to 5.6%, while the best conventional system achieves 6.7%; on a dictation task our model achieves a WER of 4.1% compared to 5% for the conventional system.

PDF Abstract

State-of-the-Art Speech Recognition with Sequence-to-Sequence Models

In the paper titled "State-of-the-art Speech Recognition with Sequence-to-Sequence Models," the authors investigate significant advancements within the field of attention-based encoder-decoder architectures, specifically focusing on Listen, Attend, and Spell (LAS) models. The primary objective of this paper is the enhancement of automatic speech recognition (ASR) systems, particularly for the challenging task of voice search.

The authors introduce notable structural and optimization modifications to the LAS model, achieving remarkable improvements in Word Error Rate (WER). A key contribution of this work is the detailed comparison of traditional grapheme-based approaches with word piece models (WPM). They demonstrate that the substitution of graphemes with WPM affords better performance due to the stronger LLM properties of word pieces.

Further structural enhancement is achieved through the implementation of multi-head attention (MHA), which significantly elevates the model’s ability to attend to multiple input features simultaneously. This advancement is supported by a 13% relative improvement in WER attributed to MHA.

Optimization techniques are rigorously investigated as well. The paper details the use of synchronous training, which yields a more converged optimum with better stability compared to asynchronous training. Scheduled sampling (SS) and label smoothing (LS) are also employed, contributing to a reduction in overconfidence and providing regularization benefits. Perhaps most notable is the adoption of Minimum Word Error Rate (MWER) training, designed to directly optimize the sequence-level loss to be more congruent with the WER metric. This approach results in an additional 13.4% relative improvement in WER.

The culmination of these techniques results in a substantial enhancement in performance. The proposed LAS model achieves a WER of 5.6% on a 12,500-hour voice search task, outperforming the best conventional HMM-based system which records a WER of 6.7%. Similarly, on a dictation task, the LAS model achieves a WER of 4.1%, compared to the 5% WER of the conventional system.

Beyond structural and optimization improvements, the paper also explores the integration of an external LLM for second-pass rescoring. This step further polishes the final output, offering an additional 3.4% relative improvement in WER.

Theoretical implications of this research highlight the diminishing necessity of separate acoustic, pronunciation, and LLMs in ASR systems. By folding these components into a single, unified neural network, the complexity of ASR training is substantially alleviated, discarding the need for finite state transducers, lexicons, and text normalization pipelines traditionally required in conventional systems.

Practically, the introduction of word piece models and multi-head attention establishes more robust recognition capabilities especially in noisy and acoustically diverse environments. The synchronous training framework enhances the stability and convergence rate of large-scale neural network training, making the implementation more viable for real-world applications.

In forecasting future developments, one anticipates further exploration of streaming capabilities for such attention-based models. Adaptations like the Neural Transducer, which offer low-latency streaming decoding, are likely to be pivotal in making sequence-to-sequence models more practical for real-time applications.

Overall, this paper not only demonstrates the practicality of sequence-to-sequence models in surpassing state-of-the-art conventional ASR systems but also lays the groundwork for future innovations in the domain of end-to-end speech recognition.

PDF Markdown Bookmark Chat (Pro)

Authors (14)

Chung-Cheng Chiu (48 papers)
Tara N. Sainath (79 papers)
Yonghui Wu (115 papers)
Rohit Prabhavalkar (59 papers)
Patrick Nguyen (15 papers)
Zhifeng Chen (65 papers)
Anjuli Kannan (19 papers)
Ron J. Weiss (30 papers)
Kanishka Rao (31 papers)
Ekaterina Gonina (3 papers)
Navdeep Jaitly (67 papers)
Bo Li (1107 papers)
Jan Chorowski (29 papers)
Michiel Bacchiani (16 papers)

Citations (1,133)

View on Semantic Scholar

State-of-the-art Speech Recognition With Sequence-to-Sequence Models (1712.01769v6)

State-of-the-Art Speech Recognition with Sequence-to-Sequence Models

Related Papers