Improving speech recognition by revising gated recurrent units (1710.00641v1)

Published 29 Sep 2017 in cs.CL, cs.AI, cs.LG, and cs.NE

Abstract: Speech recognition is largely taking advantage of deep learning, showing that substantial benefits can be obtained by modern Recurrent Neural Networks (RNNs). The most popular RNNs are Long Short-Term Memory (LSTMs), which typically reach state-of-the-art performance in many tasks thanks to their ability to learn long-term dependencies and robustness to vanishing gradients. Nevertheless, LSTMs have a rather complex design with three multiplicative gates, that might impair their efficient implementation. An attempt to simplify LSTMs has recently led to Gated Recurrent Units (GRUs), which are based on just two multiplicative gates. This paper builds on these efforts by further revising GRUs and proposing a simplified architecture potentially more suitable for speech recognition. The contribution of this work is two-fold. First, we suggest to remove the reset gate in the GRU design, resulting in a more efficient single-gate architecture. Second, we propose to replace tanh with ReLU activations in the state update equations. Results show that, in our implementation, the revised architecture reduces the per-epoch training time with more than 30% and consistently improves recognition performance across different tasks, input features, and noisy conditions when compared to a standard GRU.

Citations (52)

View on Semantic Scholar

Summary

The paper proposes revising Gated Recurrent Units (GRUs) for speech recognition by removing the reset gate and incorporating ReLU activations with batch normalization.
Experimental results on TIMIT and DIRHA datasets demonstrate that the modified GRUs achieve improved performance, including lower error rates, and reduce training time by over 30%.
The simplified architecture facilitates faster training and deployment, positioning revised GRUs as a strong candidate for efficient speech processing applications and future integration into end-to-end models.

Improving Speech Recognition by Revising Gated Recurrent Units

This paper proposes modifications to the standard Gated Recurrent Units (GRUs) architecture to enhance the efficiency and effectiveness of speech recognition systems. The research focuses on simplifying GRUs by removing the reset gate and incorporating Rectified Linear Units (ReLU) activations instead of hyperbolic tangent (tanh) functions.

Architecture Modifications

In traditional GRUs, two gates—the update and the reset gate—control the flow of information through time steps. This paper argues for the elimination of the reset gate, suggesting that its role may be redundant, particularly in the context of speech signals, which change at a relatively slow pace. It's posited that the GRU can effectively manage dependencies without the reset gate, thereby simplifying the architecture and improving computational efficiency.

Further, the paper replaces tanh activations in GRUs with ReLU, asserting that ReLU neurons reduce vanishing gradient problems and foster faster network training. Previously, ReLU activations were considered unstable for RNNs due to their unbounded nature over long sequences. However, when paired with batch normalization, ReLU provides stability and aids in faster convergence.

Experimental Validation

Experiments on the TIMIT dataset, as well as distant speech recognition tasks using the DIRHA English WSJ dataset, reveal that the proposed architecture not only enhances performance but also significantly reduces training time—by over 30% compared to standard GRUs. Across different input features like MFCCs and fMLLR, the revised GRU model consistently delivered superior results.

The results show a notable decrease in the Word Error Rate (WER) and Phone Error Rate (PER), which establishes the proposed architecture as a promising candidate for efficient speech recognition modeling.

Implications and Future Directions

The simplification of GRUs and the adoption of ReLU activations address both theoretical and practical aspects of RNNs for speech recognition. The reduced complexity facilitates faster training, critical for deploying models in real-world applications. These improvements position GRU as an attractive alternative for speech processing, inviting exploration into larger datasets and end-to-end architecture applications.

Future research might focus on verifying these findings with extensive tasks such as LibriSpeech or Switchboard datasets and optimizing computational efficiency across different hardwares. The integration into modern end-to-end frameworks, including CTC and attention-based models, presents another avenue for ensuring broader applicability and sustained improvements in speech recognition technology.

Related Papers

YouTube

Show All Videos