Attention-Based Models for Speech Recognition (1506.07503v1)

Published 24 Jun 2015 in cs.CL, cs.LG, cs.NE, and stat.ML

Abstract: Recurrent sequence generators conditioned on input data through an attention mechanism have recently shown very good performance on a range of tasks in- cluding machine translation, handwriting synthesis and image caption gen- eration. We extend the attention-mechanism with features needed for speech recognition. We show that while an adaptation of the model used for machine translation in reaches a competitive 18.7% phoneme error rate (PER) on the TIMIT phoneme recognition task, it can only be applied to utterances which are roughly as long as the ones it was trained on. We offer a qualitative explanation of this failure and propose a novel and generic method of adding location-awareness to the attention mechanism to alleviate this issue. The new method yields a model that is robust to long inputs and achieves 18% PER in single utterances and 20% in 10-times longer (repeated) utterances. Finally, we propose a change to the at- tention mechanism that prevents it from concentrating too much on single frames, which further reduces PER to 17.6% level.

Authors (5)

Jan Chorowski (29 papers)
Dzmitry Bahdanau (46 papers)
Dmitriy Serdyuk (20 papers)
Kyunghyun Cho (292 papers)
Yoshua Bengio (601 papers)

Citations (2,556)

View on Semantic Scholar

Summary

Essay: Attention-Based Models for Speech Recognition

The paper "Attention-Based Models for Speech Recognition" by Jan Chorowski, Dzmitry Bahdanau, Dmitriy Serdyuk, Kyunghyun Cho, and Yoshua Bengio introduces novel extensions to attention-based recurrent networks, making them applicable to the speech recognition domain. This research addresses a significant challenge in processing long and noisy input sequences inherent in speech.

Motivation and Context

Recurrent sequence generators conditioned through an attention mechanism have already demonstrated their utility across various domains, including machine translation, handwriting synthesis, and image caption generation. However, extending these attention mechanisms to speech recognition poses unique challenges. Unlike text-based tasks such as machine translation that deal with relatively shorter and less noisy sequences, speech recognition often involves much longer inputs and demands distinguishing between similar speech fragments within a single utterance.

Model and Approach

The authors initiate their paper using an adaptation of the attention model originally proposed for machine translation. This baseline exhibits a competitive phoneme error rate (PER) of 18.7% on the TIMIT dataset. However, this model's performance degrades significantly when dealing with longer sequences, due to its tendency to track the absolute position within the input sequence rather than relying on intrinsic content features.

To mitigate this limitation, the authors propose an enhancement to the attention mechanism by incorporating location-awareness. This is achieved through convolutional features derived from the attention weights of the previous step, creating a hybrid attention mechanism combining both content and positional information. The operational mechanics of this approach involve convolving the previous attention weights with trainable filters and integrating these auxiliary convolutional features into the scoring function of the attention mechanism.

Experimental Setup and Results

The experiments are conducted on the TIMIT corpus, a standard dataset for phoneme recognition. The baseline model, adapted from machine translation, achieves a PER of 18.7%. By integrating convolutional features, the model demonstrates improved robustness with a PER of 18.0%. Further refinement by modifying the attention mechanism to prevent excessive concentration on single frames reduces the PER to 17.6%.

The researchers also explore the behavior of these models on artificially elongated utterances. The location-aware models, particularly those utilizing convolutional features, successfully handle sequences many times longer than the training data, maintaining PERs below 20% for extended utterances.

Implications

The introduction of a location-aware attention mechanism marks a substantial step towards end-to-end trainable speech recognition systems, moving away from traditional, hybrid approaches combining different models for acoustic, phonetic, and LLMing. This advancement has potential implications for developing effective neural architectures capable of real-time speech transcription without reliance on pre-engineered components or complex preprocessing stages.

Furthermore, the proposed attention-based models with convolutional features exhibit a marked improvement in handling long sequences. This has broader implications beyond speech recognition, suggesting that similar strategies could be beneficial in other domains requiring the processing of long and noisy input sequences.

Future Directions

Future research could focus on integrating these advancements with monolingual LLMs directly into the ARSG architecture, enhancing the contextual understanding and further reducing error rates. Additionally, extending the principles of location-aware attention mechanisms with convolutional features could enhance other sequence-to-sequence tasks, such as image captioning and text generation, by improving the ability to manage extended contexts.

Conclusion

The proposed attention-based models demonstrate significant progress in speech recognition, achieving competitive performance on the TIMIT dataset and robustly handling long sequences. The combination of content and location-based attention mechanisms offers a promising path for future developments in end-to-end trainable neural architectures across various domains.

PDF Markdown

Related Papers

Find Related Papers