- The paper presents a simplified ASR framework that trains a deep bidirectional RNN with the CTC objective, eliminating the need for pre-generated frame labels.
- WFST-based decoding integrates lexicons and language models effectively, reducing the real-time factor and enhancing decoding speed.
- Experiments on the WSJ corpus show both phoneme and character-based systems achieving WERs below 8%, highlighting the framework's scalability and practical viability.
EESEN: End-to-End Speech Recognition using Deep RNN Models and WFST-based Decoding
The paper "EESEN: End-to-End Speech Recognition using Deep RNN Models and WFST-based Decoding," authored by Yajie Miao, Mohammad Gowayyed, and Florian Metze, introduces the Eesen framework, which significantly streamlines the process of building state-of-the-art automatic speech recognition (ASR) systems. Eesen leverages Recurrent Neural Networks (RNNs), specifically Long Short-Term Memory (LSTM) units, and utilizes the connectionist temporal classification (CTC) objective function to train models without requiring pre-generated frame labels. This approach eliminates much of the complexity associated with traditional ASR system development.
Key Contributions
- Simplified Acoustic Modeling: The Eesen framework employs a single deep bidirectional RNN for acoustic modeling, targeting context-independent labels such as phonemes or characters. The training process uses the CTC objective function, allowing the model to learn the alignment between speech and label sequences autonomously.
- WFST-based Decoding: Eesen's distinctive feature is its generalized decoding approach using weighted finite-state transducers (WFSTs). This method integrates lexicons and LLMs into the CTC decoding process efficiently. The WFST representation provides a clear framework to accommodate the CTC blank label and supports beam search during decoding, enhancing both effectiveness and efficiency.
Experimental Validation
The experiments conducted on the Wall Street Journal (WSJ) corpus demonstrate that the Eesen framework achieves word error rates (WERs) comparable to strong hybrid HMM/DNN baselines while significantly improving decoding speed. The results are benchmarked on the eval92 set, showing that the phoneme-based Eesen system reaches a WER of 7.87% when both the lexicon and LLM are applied in decoding. This value contrasts starkly to a 26.92% WER observed when only the lexicon is used, underscoring the efficacy of the WFST-based decoding in integrating LLMs.
Character-Based Systems
The paper extends its investigation to character-based systems, revealing that this approach too yields competitive performance. With an expanded vocabulary and re-trained LLMs, the character-based system achieves a WER of 7.34%. This outperforms previously reported results for purely end-to-end systems on the same dataset, highlighting the robustness and flexibility of the Eesen framework.
Practical and Theoretical Implications
The simplification introduced by Eesen addresses several challenges associated with traditional ASR systems, such as the dependency on multiple resources and the stepwise training procedures. This simplification has practical implications, particularly in low-resource languages and rapid prototyping scenarios where building GMM models and generating frame-level alignments are prohibitive.
From a theoretical perspective, Eesen's integration of WFST-based decoding with the CTC-trained models bridges a significant gap in end-to-end ASR research, providing a unified approach that benefits from the strengths of both paradigms. Moreover, the significant speedup in decoding—a reduction of the real-time factor from 2.06 to 0.64—demonstrates the practical viability of deploying such systems in real-world applications.
Future Developments
The authors hint at several avenues for future research. One area of interest involves enhancing the WERs of Eesen systems using advanced learning techniques and alternative decoding methods. Moreover, expanding Eesen's applicability to various languages and speech types (e.g., noisy or far-field) could provide deeper insights into the efficacy of end-to-end ASR systems under diverse conditions. Another direction is investigating new speaker adaptation and adaptive training techniques tailored for CTC models, given the non-reliance on GMM-based front-ends in the current setup.
Conclusion
The Eesen framework represents a significant progression in the field of end-to-end ASR by integrating deep RNN models with WFST-based decoding. Its open-source nature allows for continuous improvement and benchmarking, providing a robust foundation for future research in acoustic modeling and ASR system development. The comprehensive experimental validation on the WSJ corpus, coupled with the framework's scalability and efficiency, positions Eesen as a valuable contribution to the contemporary discourse in speech recognition technology.