Bridging the Gap between Training and Inference for Neural Machine Translation
The paper presents a novel approach to addressing exposure bias in Neural Machine Translation (NMT), an area where the mismatch between training and inference stages often leads to performance degradation. Traditional NMT models predict target words sequentially, relying on ground truth context during training, but at inference, they must generate sequences from scratch. This discrepancy results in exposure bias, leading to error accumulation in generated sequences.
Proposed Method
The authors propose an approach that samples context words during training not only from the ground truth sequences but also from sequences predicted by the model. This is intended to reduce the gap between training and inference. An innovative aspect of this method is the incorporation of sentence-level oracle selection, which allows the model to accommodate alternative, yet reasonable translations, thus addressing the issues of word-level correction and overcorrection.
Key elements of the approach include:
- Oracle Word Selection:
- Word-Level Oracle: Implements a greedy search at each prediction step to select words based on predicted distributions.
- Sentence-Level Oracle: Employs a beam search combined with sentence-level evaluation metrics such as BLEU, facilitating more flexible sequence generation and overcorrection recovery.
- Sampling with Decay: A dynamic sampling approach adjusts the probability of sampling from ground truth words, which decreases as training progresses. This encourages models to learn under conditions similar to those at inference.
- Gumbel-Max Technique: Introduced to sample oracle words more robustly, adding stochastic perturbations to the predicted word distributions.
Experimental Evaluation
Experiments are conducted on the NIST Chinese→English and WMT'14 English→German translation tasks, with strong numerical results affirming the superiority of the proposed method over existing approaches, including scheduled sampling and sentence-level optimization strategies like MIXER. The approach demonstrates an ability to significantly improve BLEU scores across diverse datasets and NMT model architectures, including RNN-based and Transformer models.
The authors' results indicate that their approach effectively mitigates exposure bias, allowing for improved sentence-level translation quality. Importantly, the sentence-level oracle selection shows clearer benefits over word-level alternatives, evidencing the advantages of leveraging sentence-wide evaluation in NMT training.
Implications and Future Perspective
The findings have both practical and theoretical implications. Practically, the method enhances translation quality without substantial changes to existing NMT architectures, making it feasible to integrate into standard deployments. Theoretically, it provides insights into addressing exposure bias through dynamic sampling and oracle-based learning, suggesting new directions for further exploration in sequence generation tasks.
Looking forward, the implications of this research could permeate other areas of AI involving sequential decision-making tasks, potentially leading to advancements beyond machine translation, in areas such as dialogue systems, automatic summarization, and even reinforcement learning paradigms where similar drift between training and application phases may occur.
The paper makes a tangible contribution to reducing the training-inference gap in NMT, providing a robust framework that could be extended or modified for application in a wider array of language translation tasks and other sequence prediction problems.