Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
41 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Bridging the Gap between Training and Inference for Neural Machine Translation (1906.02448v2)

Published 6 Jun 2019 in cs.CL, cs.LG, and stat.ML
Bridging the Gap between Training and Inference for Neural Machine Translation

Abstract: Neural Machine Translation (NMT) generates target words sequentially in the way of predicting the next word conditioned on the context words. At training time, it predicts with the ground truth words as context while at inference it has to generate the entire sequence from scratch. This discrepancy of the fed context leads to error accumulation among the way. Furthermore, word-level training requires strict matching between the generated sequence and the ground truth sequence which leads to overcorrection over different but reasonable translations. In this paper, we address these issues by sampling context words not only from the ground truth sequence but also from the predicted sequence by the model during training, where the predicted sequence is selected with a sentence-level optimum. Experiment results on Chinese->English and WMT'14 English->German translation tasks demonstrate that our approach can achieve significant improvements on multiple datasets.

Bridging the Gap between Training and Inference for Neural Machine Translation

The paper presents a novel approach to addressing exposure bias in Neural Machine Translation (NMT), an area where the mismatch between training and inference stages often leads to performance degradation. Traditional NMT models predict target words sequentially, relying on ground truth context during training, but at inference, they must generate sequences from scratch. This discrepancy results in exposure bias, leading to error accumulation in generated sequences.

Proposed Method

The authors propose an approach that samples context words during training not only from the ground truth sequences but also from sequences predicted by the model. This is intended to reduce the gap between training and inference. An innovative aspect of this method is the incorporation of sentence-level oracle selection, which allows the model to accommodate alternative, yet reasonable translations, thus addressing the issues of word-level correction and overcorrection.

Key elements of the approach include:

  • Oracle Word Selection:
    • Word-Level Oracle: Implements a greedy search at each prediction step to select words based on predicted distributions.
    • Sentence-Level Oracle: Employs a beam search combined with sentence-level evaluation metrics such as BLEU, facilitating more flexible sequence generation and overcorrection recovery.
  • Sampling with Decay: A dynamic sampling approach adjusts the probability of sampling from ground truth words, which decreases as training progresses. This encourages models to learn under conditions similar to those at inference.
  • Gumbel-Max Technique: Introduced to sample oracle words more robustly, adding stochastic perturbations to the predicted word distributions.

Experimental Evaluation

Experiments are conducted on the NIST Chinese→English and WMT'14 English→German translation tasks, with strong numerical results affirming the superiority of the proposed method over existing approaches, including scheduled sampling and sentence-level optimization strategies like MIXER. The approach demonstrates an ability to significantly improve BLEU scores across diverse datasets and NMT model architectures, including RNN-based and Transformer models.

The authors' results indicate that their approach effectively mitigates exposure bias, allowing for improved sentence-level translation quality. Importantly, the sentence-level oracle selection shows clearer benefits over word-level alternatives, evidencing the advantages of leveraging sentence-wide evaluation in NMT training.

Implications and Future Perspective

The findings have both practical and theoretical implications. Practically, the method enhances translation quality without substantial changes to existing NMT architectures, making it feasible to integrate into standard deployments. Theoretically, it provides insights into addressing exposure bias through dynamic sampling and oracle-based learning, suggesting new directions for further exploration in sequence generation tasks.

Looking forward, the implications of this research could permeate other areas of AI involving sequential decision-making tasks, potentially leading to advancements beyond machine translation, in areas such as dialogue systems, automatic summarization, and even reinforcement learning paradigms where similar drift between training and application phases may occur.

The paper makes a tangible contribution to reducing the training-inference gap in NMT, providing a robust framework that could be extended or modified for application in a wider array of language translation tasks and other sequence prediction problems.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (5)
  1. Wen Zhang (170 papers)
  2. Yang Feng (230 papers)
  3. Fandong Meng (174 papers)
  4. Di You (13 papers)
  5. Qun Liu (230 papers)
Citations (238)
Youtube Logo Streamline Icon: https://streamlinehq.com