Neural Speech Recognizer: Acoustic-to-Word LSTM Model for Large Vocabulary Speech Recognition (1610.09975v1)

Published 31 Oct 2016 in cs.CL, cs.LG, and cs.NE

Abstract: We present results that show it is possible to build a competitive, greatly simplified, large vocabulary continuous speech recognition system with whole words as acoustic units. We model the output vocabulary of about 100,000 words directly using deep bi-directional LSTM RNNs with CTC loss. The model is trained on 125,000 hours of semi-supervised acoustic training data, which enables us to alleviate the data sparsity problem for word models. We show that the CTC word models work very well as an end-to-end all-neural speech recognition model without the use of traditional context-dependent sub-word phone units that require a pronunciation lexicon, and without any LLM removing the need to decode. We demonstrate that the CTC word models perform better than a strong, more complex, state-of-the-art baseline with sub-word units.

Authors (3)

Hagen Soltau (19 papers)
Hank Liao (13 papers)
Hasim Sak (15 papers)

Citations (308)

View on Semantic Scholar

Summary

Neural Speech Recognizer: Acoustic-to-Word LSTM Model for Large Vocabulary Speech Recognition

The paper "Neural Speech Recognizer: Acoustic-to-Word LSTM Model for Large Vocabulary Speech Recognition" presents a significant advancement in the domain of end-to-end automatic speech recognition (ASR) systems. This research introduces a model that directly processes input acoustic data into word units, circumventing the need for intermediate sub-word or phonetic units typically required in conventional systems. The model utilizes deep bi-directional Long Short-Term Memory (LSTM) Recurrent Neural Networks (RNNs) optimized with Connectionist Temporal Classification (CTC) loss, which facilitates efficient mapping from acoustic signals to lexical items.

Key Contributions

End-to-End Acoustic Modeling: The authors propose a streamlined approach by using whole-word units in contrast to traditional context-dependent phonetic models. This choice obviates the need for a pronunciation lexicon and complex decoding processes, simplifying the architecture and potentially increasing system robustness against domain variations.
Large Vocabulary Integration: The system accommodates an expansive vocabulary of approximately 100,000 words. This is achieved by leveraging a vast corpus of 125,000 hours of semi-supervised acoustic training data, effectively addressing the issue of data sparsity which can plague word-level models.
Elimination of LLM Dependence: By focusing on word-based outputs, the system minimizes reliance on external LLMs. Experiments demonstrate that even without any LLM, the neural speech recognizer achieves performance metrics surpassing traditional systems that employ intricate hybrid models.

Experimental Results

The paper reports a detailed comparison between the proposed model and state-of-the-art systems using context-dependent phonetic units. Key performance metrics show the following:

The system achieves a word error rate (WER) of 13.9% without any LLM support, which further improves to 13.4% with LLM integration. This is highlighted as an improvement over the best performing complex baseline model, which recorded a WER of 14.2%.
In terms of numerical results, the system's strong training uses user-uploaded YouTube video captions, filtered to ensure alignment fidelity, enabling the use of a significant corpus of usable acoustic data. Results from this large-scale semi-supervised approach reinforce the model's compatibility with real-world applications on diverse data sources.

Methodology and Architecture

The neural speech recognizer employs a multi-layer LSTM RNN framework, with bidirectional processing that enriches temporal sensitivity and sequence alignment through CTC loss. The paper highlights:

Training Infrastructure: Utilizing asynchronous stochastic gradient descent (ASGD) on distributed machine setups ensures efficient handling of large datasets.
Word Posterior Probabilities: The output layer directly predicts word probabilities, further processed by finite state transducers (FSTs) to manage alignment and produce coherent word sequences.

Implications and Future Directions

The implications of this paper are multi-faceted:

Practical Applications: By delivering a comprehensive neural network model with competitive accuracy levels, this research paves the way for deploying scalable ASR systems in real-time applications without necessitating complex backend processing.
Theoretical Advancements: Simplified architecture validates the hypothesis that high volumetric, semi-supervised data can compensate for conventional model complexities inherent in lexicon-dependent systems.
Future Research Directions: The paper opens exploratory avenues into refining acoustic models to improve processing efficiency further and incorporating broader contextual understanding within end-to-end frameworks to enhance performance on varied and unseen vocabularies.

In conclusion, the research delineates a significant progression in neural ASR technologies, promising higher accuracy with lesser transcription intricacies, and potentially setting a new standard in end-to-end speech recognition systems.

PDF Markdown

Related Papers

Find Related Papers