EESEN: End-to-End Speech Recognition using Deep RNN Models and WFST-based Decoding (1507.08240v3)

Published 29 Jul 2015 in cs.CL and cs.LG

Abstract: The performance of automatic speech recognition (ASR) has improved tremendously due to the application of deep neural networks (DNNs). Despite this progress, building a new ASR system remains a challenging task, requiring various resources, multiple training stages and significant expertise. This paper presents our Eesen framework which drastically simplifies the existing pipeline to build state-of-the-art ASR systems. Acoustic modeling in Eesen involves learning a single recurrent neural network (RNN) predicting context-independent targets (phonemes or characters). To remove the need for pre-generated frame labels, we adopt the connectionist temporal classification (CTC) objective function to infer the alignments between speech and label sequences. A distinctive feature of Eesen is a generalized decoding approach based on weighted finite-state transducers (WFSTs), which enables the efficient incorporation of lexicons and LLMs into CTC decoding. Experiments show that compared with the standard hybrid DNN systems, Eesen achieves comparable word error rates (WERs), while at the same time speeding up decoding significantly.

Citations (744)

View on Semantic Scholar

Summary

The paper presents a simplified ASR framework that trains a deep bidirectional RNN with the CTC objective, eliminating the need for pre-generated frame labels.
WFST-based decoding integrates lexicons and language models effectively, reducing the real-time factor and enhancing decoding speed.
Experiments on the WSJ corpus show both phoneme and character-based systems achieving WERs below 8%, highlighting the framework's scalability and practical viability.

EESEN: End-to-End Speech Recognition using Deep RNN Models and WFST-based Decoding

The paper "EESEN: End-to-End Speech Recognition using Deep RNN Models and WFST-based Decoding," authored by Yajie Miao, Mohammad Gowayyed, and Florian Metze, introduces the Eesen framework, which significantly streamlines the process of building state-of-the-art automatic speech recognition (ASR) systems. Eesen leverages Recurrent Neural Networks (RNNs), specifically Long Short-Term Memory (LSTM) units, and utilizes the connectionist temporal classification (CTC) objective function to train models without requiring pre-generated frame labels. This approach eliminates much of the complexity associated with traditional ASR system development.

Key Contributions

Simplified Acoustic Modeling: The Eesen framework employs a single deep bidirectional RNN for acoustic modeling, targeting context-independent labels such as phonemes or characters. The training process uses the CTC objective function, allowing the model to learn the alignment between speech and label sequences autonomously.
WFST-based Decoding: Eesen's distinctive feature is its generalized decoding approach using weighted finite-state transducers (WFSTs). This method integrates lexicons and LLMs into the CTC decoding process efficiently. The WFST representation provides a clear framework to accommodate the CTC blank label and supports beam search during decoding, enhancing both effectiveness and efficiency.

Experimental Validation

The experiments conducted on the Wall Street Journal (WSJ) corpus demonstrate that the Eesen framework achieves word error rates (WERs) comparable to strong hybrid HMM/DNN baselines while significantly improving decoding speed. The results are benchmarked on the eval92 set, showing that the phoneme-based Eesen system reaches a WER of 7.87% when both the lexicon and LLM are applied in decoding. This value contrasts starkly to a 26.92% WER observed when only the lexicon is used, underscoring the efficacy of the WFST-based decoding in integrating LLMs.

Character-Based Systems

The paper extends its investigation to character-based systems, revealing that this approach too yields competitive performance. With an expanded vocabulary and re-trained LLMs, the character-based system achieves a WER of 7.34%. This outperforms previously reported results for purely end-to-end systems on the same dataset, highlighting the robustness and flexibility of the Eesen framework.

Practical and Theoretical Implications

The simplification introduced by Eesen addresses several challenges associated with traditional ASR systems, such as the dependency on multiple resources and the stepwise training procedures. This simplification has practical implications, particularly in low-resource languages and rapid prototyping scenarios where building GMM models and generating frame-level alignments are prohibitive.

From a theoretical perspective, Eesen's integration of WFST-based decoding with the CTC-trained models bridges a significant gap in end-to-end ASR research, providing a unified approach that benefits from the strengths of both paradigms. Moreover, the significant speedup in decoding—a reduction of the real-time factor from 2.06 to 0.64—demonstrates the practical viability of deploying such systems in real-world applications.

Future Developments

The authors hint at several avenues for future research. One area of interest involves enhancing the WERs of Eesen systems using advanced learning techniques and alternative decoding methods. Moreover, expanding Eesen's applicability to various languages and speech types (e.g., noisy or far-field) could provide deeper insights into the efficacy of end-to-end ASR systems under diverse conditions. Another direction is investigating new speaker adaptation and adaptive training techniques tailored for CTC models, given the non-reliance on GMM-based front-ends in the current setup.

Conclusion

The Eesen framework represents a significant progression in the field of end-to-end ASR by integrating deep RNN models with WFST-based decoding. Its open-source nature allows for continuous improvement and benchmarking, providing a robust foundation for future research in acoustic modeling and ASR system development. The comprehensive experimental validation on the WSJ corpus, coupled with the framework's scalability and efficiency, positions Eesen as a valuable contribution to the contemporary discourse in speech recognition technology.

PDF Markdown