Feature Learning in Deep Neural Networks - Studies on Speech Recognition Tasks (1301.3605v3)

Published 16 Jan 2013 in cs.LG, cs.CL, cs.NE, and eess.AS

Abstract: Recent studies have shown that deep neural networks (DNNs) perform significantly better than shallow networks and Gaussian mixture models (GMMs) on large vocabulary speech recognition tasks. In this paper, we argue that the improved accuracy achieved by the DNNs is the result of their ability to extract discriminative internal representations that are robust to the many sources of variability in speech signals. We show that these representations become increasingly insensitive to small perturbations in the input with increasing network depth, which leads to better speech recognition performance with deeper networks. We also show that DNNs cannot extrapolate to test samples that are substantially different from the training examples. If the training data are sufficiently representative, however, internal features learned by the DNN are relatively stable with respect to speaker differences, bandwidth differences, and environment distortion. This enables DNN-based recognizers to perform as well or better than state-of-the-art systems based on GMMs or shallow networks without the need for explicit model adaptation or feature normalization.

PDF Abstract

Feature Learning in Deep Neural Networks: A Focus on Speech Recognition

The paper "Feature Learning in Deep Neural Networks -- Studies on Speech Recognition Tasks" presents a comprehensive analysis of the performance of deep neural networks (DNNs) over traditional models such as Gaussian mixture models (GMMs) in the context of large vocabulary speech recognition tasks. The authors, Dong Yu, Michael L. Seltzer, Jinyu Li, Jui-Ting Huang, and Frank Seide, propose that the superior performance of DNNs is attributed to their ability to generate robust and discriminative features that are less sensitive to variations in the input data, such as those present in speech signals.

Key Findings

The paper's primary assertion is that deeper neural networks outperform their shallower counterparts due to their capacity to learn invariant features across multiple processing layers. This is empirically validated through a series of experiments demonstrating significant reductions in word error rates (WER) on standard datasets, notably the Switchboard (SWB) corpus. The findings suggest a 28% relative error reduction when employing DNNs compared to conventional GMM-HMM systems, underscoring the discriminative power of DNNs as they gain depth.

The paper also highlights the limitations of DNNs in handling test samples that are significantly different from the training data, pointing to the need for representative sample coverage during training. This observation was supported by experiments on mixed-bandwidth ASR, where DNNs could not generalize to narrowband speech unless they were trained with mixed-bandwidth data.

Environmental and Speaker Robustness

Beyond invariance to small perturbations, the paper explores the robustness of DNNs to speaker variability and environmental distortions. Remarkably, DNNs demonstrated a level of speaker adaptation that traditionally required additional methods like Vocal Tract Length Normalization (VTLN) and feature-space Discriminative Linear Regression (fDLR) in GMM-HMM systems. The reduced impact of these adaptation techniques on DNNs suggests an inherent ability of deep networks to abstract speaker-invariant features.

Regarding noise robustness, the DNNs trained on the Aurora 4 corpus achieved results on par with or superior to those employing complex adaptation and noise compensation strategies. These results were attained without the necessity for iterative recognition passes or explicit environment adaptation, emphasizing the practical advantages of DNNs in operational speech recognition systems.

Implications and Future Directions

This research illustrates the potential of DNNs to transform feature representation and classification in automatic speech recognition (ASR), paving the way for models that can operate effectively with reduced preprocessing and adaptation complexity. The implications extend to various applications where robustness to variability and adaptation efficiency are critical, including real-time voice search and interactive voice response systems.

Theoretical exploration into why DNNs develop such robustness would benefit from future work, alongside investigations into the scalability of these models to multilingual and cross-domain ASR tasks. Additionally, the interplay between network architecture, data diversity, and the generalization capabilities of DNNs warrants further paper, which could lead to foundational improvements in model design and training paradigms.

In conclusion, this paper provides compelling evidence of the effectiveness of deep neural networks in learning robust feature representations for speech recognition tasks. It serves as a valuable resource for researchers aiming to leverage deep learning in developing advanced ASR systems that are resilient to environmental and speaker-induced variabilities.

PDF Markdown Bookmark Chat (Pro)

Authors (5)

Dong Yu (329 papers)
Michael L. Seltzer (34 papers)
Jinyu Li (164 papers)
Jui-Ting Huang (2 papers)
Frank Seide (16 papers)

Citations (261)

View on Semantic Scholar

Feature Learning in Deep Neural Networks - Studies on Speech Recognition Tasks (1301.3605v3)

Feature Learning in Deep Neural Networks: A Focus on Speech Recognition

Key Findings

Environmental and Speaker Robustness

Implications and Future Directions

Related Papers