Minimum Word Error Rate Training for Attention-based Sequence-to-Sequence Models (1712.01818v1)

Published 5 Dec 2017 in cs.CL, eess.AS, and stat.ML

Abstract: Sequence-to-sequence models, such as attention-based models in automatic speech recognition (ASR), are typically trained to optimize the cross-entropy criterion which corresponds to improving the log-likelihood of the data. However, system performance is usually measured in terms of word error rate (WER), not log-likelihood. Traditional ASR systems benefit from discriminative sequence training which optimizes criteria such as the state-level minimum Bayes risk (sMBR) which are more closely related to WER. In the present work, we explore techniques to train attention-based models to directly minimize expected word error rate. We consider two loss functions which approximate the expected number of word errors: either by sampling from the model, or by using N-best lists of decoded hypotheses, which we find to be more effective than the sampling-based method. In experimental evaluations, we find that the proposed training procedure improves performance by up to 8.2% relative to the baseline system. This allows us to train grapheme-based, uni-directional attention-based models which match the performance of a traditional, state-of-the-art, discriminative sequence-trained system on a mobile voice-search task.

Authors (7)

Rohit Prabhavalkar (59 papers)
Tara N. Sainath (79 papers)
Yonghui Wu (115 papers)
Patrick Nguyen (15 papers)
Zhifeng Chen (65 papers)
Chung-Cheng Chiu (48 papers)
Anjuli Kannan (19 papers)

Citations (161)

View on Semantic Scholar

Summary

The paper introduces a method to train attention-based sequence-to-sequence ASR models to directly minimize the expected word error rate (WER), addressing the discrepancy between standard training criteria and the target evaluation metric.
The authors demonstrate that approximating the expected WER using N-best lists of decoded hypotheses is more beneficial than sampling, yielding performance improvements of up to 8.2% compared to cross-entropy baselines.
This minimum WER training enables grapheme-based attention models to achieve performance parity with traditional state-of-the-art CD-phone-based ASR systems, highlighting the approach's efficacy for real-world applications.

Minimum Word Error Rate Training for Attention-based Sequence-to-Sequence Models

This paper by Prabhavalkar et al. introduces a method for optimizing attention-based sequence-to-sequence models to directly minimize the expected word error rate (WER). The paper addresses the prevalent discrepancy between the typical training criterion for ASR models, often cross-entropy that improves log-likelihood, and the actual performance metric, WER. The research explores methodologies for more closely aligning these by minimizing expected WER during training.

The authors investigate two approximations of this expectation: sampling from the model's distributions and employing N-best lists of decoded hypotheses. Their empirical results demonstrate that using N-best lists is more beneficial, improving performance by up to 8.2% compared to traditional cross-entropy-trained baseline systems. The use of N-best hypotheses aligns with conventional practices in ASR sequence training, emphasizing a focus on viable high-probability sequences rather than a random sampling approach which includes lower probability paths.

Attention-Based Sequence-to-Sequence Models

The attention-based models in question consist of encoder and decoder networks facilitated by an attention mechanism. These models shift away from phoneme-based systems towards grapheme-based outputs, eliminating the need for curated pronunciation dictionaries and separate modules for text normalization. The paper contrasts two types of encoder structures: uni-directional and bi-directional LSTM networks, both enhanced with multi-headed attention to improve model performance by allowing simultaneous focus on multiple input locations.

Experimental Results

The experiments were conducted using a substantial dataset from Google voice-search traffic, augmented for robustness with noise and reverberation. Two attention models were evaluated, revealing that minimum WER training yields significant performance benefits. Notably, the uni-directional LAS models trained under the new scheme achieved parity with state-of-the-art context-dependent phoneme-based recognition systems, showcasing the efficacy of grapheme-based models optimized for WER.

The practical implication of this research is the enhanced capability for real-world ASR applications where end-to-end models can be fine-tuned to specific error rates, thus improving user experience while reducing the reliance on external LLMs and pronunciation resources. Theoretically, the approach contributes to bridging the gap between model likelihood objectives and task-specific performance criteria.

Future Directions

This work demonstrates the potential of sequence-to-sequence models to compete with traditional CD-phone-based models in specific tasks. The suggested methodology might stimulate developments in broader applications of ASR, potentially informing refining practices in other sequence prediction domains such as machine translation. Further studies could explore combining this work with reinforcement learning techniques, giving rise to adaptive systems capable of continually optimizing task-specific criteria dynamically.

Overall, this paper presents a valuable contribution to the field of automatic speech recognition, offering a compelling approach to training neural models for minimizing specific evaluation metrics.