- The paper introduces a method to train attention-based sequence-to-sequence ASR models to directly minimize the expected word error rate (WER), addressing the discrepancy between standard training criteria and the target evaluation metric.
- The authors demonstrate that approximating the expected WER using N-best lists of decoded hypotheses is more beneficial than sampling, yielding performance improvements of up to 8.2% compared to cross-entropy baselines.
- This minimum WER training enables grapheme-based attention models to achieve performance parity with traditional state-of-the-art CD-phone-based ASR systems, highlighting the approach's efficacy for real-world applications.
Minimum Word Error Rate Training for Attention-based Sequence-to-Sequence Models
This paper by Prabhavalkar et al. introduces a method for optimizing attention-based sequence-to-sequence models to directly minimize the expected word error rate (WER). The paper addresses the prevalent discrepancy between the typical training criterion for ASR models, often cross-entropy that improves log-likelihood, and the actual performance metric, WER. The research explores methodologies for more closely aligning these by minimizing expected WER during training.
The authors investigate two approximations of this expectation: sampling from the model's distributions and employing N-best lists of decoded hypotheses. Their empirical results demonstrate that using N-best lists is more beneficial, improving performance by up to 8.2% compared to traditional cross-entropy-trained baseline systems. The use of N-best hypotheses aligns with conventional practices in ASR sequence training, emphasizing a focus on viable high-probability sequences rather than a random sampling approach which includes lower probability paths.
Attention-Based Sequence-to-Sequence Models
The attention-based models in question consist of encoder and decoder networks facilitated by an attention mechanism. These models shift away from phoneme-based systems towards grapheme-based outputs, eliminating the need for curated pronunciation dictionaries and separate modules for text normalization. The paper contrasts two types of encoder structures: uni-directional and bi-directional LSTM networks, both enhanced with multi-headed attention to improve model performance by allowing simultaneous focus on multiple input locations.
Experimental Results
The experiments were conducted using a substantial dataset from Google voice-search traffic, augmented for robustness with noise and reverberation. Two attention models were evaluated, revealing that minimum WER training yields significant performance benefits. Notably, the uni-directional LAS models trained under the new scheme achieved parity with state-of-the-art context-dependent phoneme-based recognition systems, showcasing the efficacy of grapheme-based models optimized for WER.
The practical implication of this research is the enhanced capability for real-world ASR applications where end-to-end models can be fine-tuned to specific error rates, thus improving user experience while reducing the reliance on external LLMs and pronunciation resources. Theoretically, the approach contributes to bridging the gap between model likelihood objectives and task-specific performance criteria.
Future Directions
This work demonstrates the potential of sequence-to-sequence models to compete with traditional CD-phone-based models in specific tasks. The suggested methodology might stimulate developments in broader applications of ASR, potentially informing refining practices in other sequence prediction domains such as machine translation. Further studies could explore combining this work with reinforcement learning techniques, giving rise to adaptive systems capable of continually optimizing task-specific criteria dynamically.
Overall, this paper presents a valuable contribution to the field of automatic speech recognition, offering a compelling approach to training neural models for minimizing specific evaluation metrics.