Constructing Long Short-Term Memory based Deep Recurrent Neural Networks for Large Vocabulary Speech Recognition (1410.4281v2)

Published 16 Oct 2014 in cs.CL and cs.NE

Abstract: Long short-term memory (LSTM) based acoustic modeling methods have recently been shown to give state-of-the-art performance on some speech recognition tasks. To achieve a further performance improvement, in this research, deep extensions on LSTM are investigated considering that deep hierarchical model has turned out to be more efficient than a shallow one. Motivated by previous research on constructing deep recurrent neural networks (RNNs), alternative deep LSTM architectures are proposed and empirically evaluated on a large vocabulary conversational telephone speech recognition task. Meanwhile, regarding to multi-GPU devices, the training process for LSTM networks is introduced and discussed. Experimental results demonstrate that the deep LSTM networks benefit from the depth and yield the state-of-the-art performance on this task.

PDF Abstract

Overview of LSTM-based Deep Recurrent Neural Networks for Speech Recognition

The paper by Xiangang Li and Xihong Wu presents an in-depth paper on enhancing the performance of acoustic models in large vocabulary speech recognition (LVSR) through the implementation of Long Short-Term Memory (LSTM) based deep recurrent neural networks (RNNs). The research targets the inherent challenges of traditional RNNs, specifically the vanishing and exploding gradient problems, and proposes deep LSTM architectures as a solution. The authors empirically evaluate these architectures on a Mandarin Chinese conversational telephone speech recognition task, demonstrating significant performance improvements.

Deep Extensions of LSTM Networks

The main contribution of the paper stems from the exploration of deep hierarchical architectures for LSTM networks, moving beyond their inherently temporal depth to enhance their spatial depth. The research identifies and implements various strategies for increasing the depth of LSTM networks, focusing on three primary dimensions: input-to-hidden transitions, hidden-to-hidden transitions, and hidden-to-output functions. Notably, the paper introduces architectures such as LSTM with input projection (LSTM-IP) and output projection (LSTM-OP) layers, as well as combinations of feed-forward networks with LSTM layers.

Experimental Validation

Using a large vocabulary Mandarin Chinese conversational telephone speech recognition dataset, the paper rigorously evaluates several deep LSTM architectures. The results demonstrate that deep architectures, notably LSTM-OP followed by additional feed-forward layers, yield a significant reduction in Character Error Rate (CER) compared to both shallow LSTM and DNN baselines. The LSTM-OP approach, particularly when combined with a deep hidden-to-output function, achieves a CER reduction of up to 8.87% relative to state-of-the-art DNN approaches. These findings underscore the efficacy of incorporating depth into LSTM networks for acoustic modeling in speech recognition tasks.

Technical Implementation

Recognizing the computational complexity inherent in training deep LSTM networks, the paper details a GPU-based implementation employing asynchronous stochastic gradient descent (ASGD) to expedite training. The authors utilize a multi-GPU architecture coupled with the truncated back-propagation through time (BPTT) algorithm, optimizing training efficiency and making large-scale data processing feasible.

Implications and Future Directions

The implications of this research are both practical and theoretical. Practically, the proposed deep LSTM architectures have the potential to improve the efficiency and accuracy of speech recognition systems, making them more viable for real-world applications where large vocabulary and conversational nuances are critical. Theoretically, the exploration of network depth in spatial dimensions opens avenues for further research into complex hierarchical structures in neural network architectures.

Future research could focus on integrating additional sophisticated network layers, such as multiple projection layers or combining maxout units with LSTM cells, to assess their impact on model efficacy. It will be intriguing to see how these architectures perform in broader contexts, such as multilingual speech recognition tasks or in different domains of sequence modeling.

Overall, this work provides a comprehensive evaluation and novel methodologies to advance the field of LVSR through LSTM-based deep RNNs, demonstrating the benefits of network depth and laying the groundwork for future explorations in this area.

PDF Markdown Bookmark Chat (Pro)

Authors (2)

Xiangang Li (46 papers)
Xihong Wu (22 papers)

Citations (300)

View on Semantic Scholar