End-to-end Continuous Speech Recognition using Attention-based Recurrent NN: First Results (1412.1602v1)

Published 4 Dec 2014 in cs.NE, cs.LG, and stat.ML

Abstract: We replace the Hidden Markov Model (HMM) which is traditionally used in in continuous speech recognition with a bi-directional recurrent neural network encoder coupled to a recurrent neural network decoder that directly emits a stream of phonemes. The alignment between the input and output sequences is established using an attention mechanism: the decoder emits each symbol based on a context created with a subset of input symbols elected by the attention mechanism. We report initial results demonstrating that this new approach achieves phoneme error rates that are comparable to the state-of-the-art HMM-based decoders, on the TIMIT dataset.

PDF Abstract

End-to-End Continuous Speech Recognition Using Attention-Based Recurrent Neural Networks

This paper presents an investigation into replacing the traditional Hidden Markov Model (HMM) architecture in continuous speech recognition systems with a novel bidirectional recurrent neural network (RNN) framework. This framework employs an encoder-decoder architecture integrated with an attention mechanism, enabling end-to-end training without explicit input-output sequence alignment. The research underscores the capability of attention mechanisms in sequence processing tasks.

Methodology

The proposed model comprises three core components:

Encoder: Leveraging a bidirectional RNN to process input acoustic frames to extract features.
Attention Mechanism: Responsible for aligning and mapping parts of the input sequence to elements of the output sequence, reminiscent of neural machine translation techniques.
Decoder: A recurrent neural network that generates the sequence of output phonemes iteratively, using context vectors derived from the attention mechanism.

The attention mechanism plays a pivotal role by dynamically focusing on relevant portions of the input sequence, guided by a trainable scoring function. This is enhanced through gating functions and penalties to encourage monotonic alignments—a crucial feature for speech data where phonetic sequences exhibit temporal order.

Results

The model achieves a phoneme error rate of 18.57% on the TIMIT dataset, which is competitive with existing state-of-the-art HMM-DNN systems. Importantly, the model shows resilience to decoding inaccuracies even when employing narrow beam or greedy search techniques—a notable departure from the intricacies and dependencies of traditional methods that require broader search paradigms.

Implications and Future Directions

The implications of this research are multifold:

Theoretical Advancement: Demonstrates the viability of RNNs with attention mechanisms to bypass traditional per-frame prediction hassles, directly predicting phoneme sequences.
Practical Deployment: Simplifies application architecture by eliminating intricate components such as explicit HMM alignments, reducing the complexity and potentially lowering computational costs in real-time speech systems.
Future Prospects: Suggests potential extensions of this method to large vocabulary continuous speech recognition, leveraging a hierarchical RNN model that seamlessly transitions from phoneme to word-level recognition. This may simplify the modeling of word sequences while maintaining state-of-the-art accuracy, potentially reducing latency issues in speech applications.

This paper serves as a foundation for further exploration into RNN-based speech recognition, inspiring avenues for more seamless integration of deep learning techniques into traditionally statistical domains. As computational and data resources continue to grow, the fidelity and applicability of such models are expected to increase significantly.

PDF Markdown Bookmark Chat (Pro)

Authors (4)

Jan Chorowski (29 papers)
Dzmitry Bahdanau (46 papers)
Kyunghyun Cho (292 papers)
Yoshua Bengio (601 papers)

Citations (462)

View on Semantic Scholar

End-to-end Continuous Speech Recognition using Attention-based Recurrent NN: First Results (1412.1602v1)

End-to-End Continuous Speech Recognition Using Attention-Based Recurrent Neural Networks

Methodology

Results

Implications and Future Directions

Related Papers