Non-Autoregressive Neural Machine Translation (1711.02281v2)

Published 7 Nov 2017 in cs.CL and cs.LG

Abstract: Existing approaches to neural machine translation condition each output word on previously generated outputs. We introduce a model that avoids this autoregressive property and produces its outputs in parallel, allowing an order of magnitude lower latency during inference. Through knowledge distillation, the use of input token fertilities as a latent variable, and policy gradient fine-tuning, we achieve this at a cost of as little as 2.0 BLEU points relative to the autoregressive Transformer network used as a teacher. We demonstrate substantial cumulative improvements associated with each of the three aspects of our training strategy, and validate our approach on IWSLT 2016 English-German and two WMT language pairs. By sampling fertilities in parallel at inference time, our non-autoregressive model achieves near-state-of-the-art performance of 29.8 BLEU on WMT 2016 English-Romanian.

PDF Abstract

Non-Autoregressive Neural Machine Translation: A Detailed Overview

The paper "Non-Autoregressive Neural Machine Translation," authored by Jiatao Gu et al., proposes a novel approach to neural machine translation (NMT) that addresses the latency issue inherent in autoregressive models. By introducing a non-autoregressive model based on the Transformer architecture, the authors exploit parallelism, achieving significantly lower inference latency while maintaining competitive accuracy.

Introduction and Motivation

Existing state-of-the-art neural machine translation models rely on autoregressive decoding, which generates each token conditioned on previously generated tokens. This sequential generation process is non-parallelizable, resulting in higher latency, especially when compared to traditional statistical machine translation methods. The autoregressive nature is a key bottleneck, preventing efficient utilization of modern parallel computation resources during inference.

Proposed Model: Non-Autoregressive Transformer (NAT)

The crux of the proposed approach lies in its fundamental departure from autoregressive decoding. The Non-Autoregressive Transformer (NAT) introduced in this paper eliminates the sequential dependency of token generation, allowing for parallel output production. This paradigm shift is achieved through three primary innovations:

Knowledge Distillation: A teacher-student framework is employed, where an autoregressive model (the teacher) generates translations that are used to train the NAT (the student). This distillation process alleviates the multimodality problem by providing deterministic and less noisy targets.
Fertility Prediction: The concept of "fertility," which denotes the number of output tokens corresponding to each input token, is introduced as a latent variable. This is predicted using a specialized network and augments the NAT decoder inputs, allowing the model to handle multimodality more effectively.
Policy Gradient Fine-Tuning: To refine the model, policy gradient methods (REINFORCE) are used to fine-tune the NAT, leveraging the pre-trained autoregressive model as a reward predictor.

Experimental Results and Evaluation

The performance of the NAT model was rigorously evaluated on standard translation datasets—including IWSLT 2016 English-German and WMT 2016 English-Romanian—demonstrating the effectiveness of the approach. Key findings include:

Translation Quality: The NAT model achieves nearly state-of-the-art performance on WMT 2016 English-Romanian with a BLEU score of 29.8, only 2.0 BLEU points below the autoregressive baseline.
Inference Speed: The use of parallel decoding results in a significant reduction in inference latency. Specifically, the NAT model exhibits latencies approximately one-tenth of those observed with autoregressive decoding, showcasing its superiority in real-time translation tasks.
Decoding Strategies: The paper explores multiple decoding strategies, such as argmax decoding, average decoding, and Noisy Parallel Decoding (NPD). NPD enhances the search for the best translation by sampling multiple fertility sequences, thereby balancing quality and inference speed.

Implications and Future Directions

The practical implications of this research are clear: reducing translation latency without a considerable sacrifice in quality benefits numerous real-time applications, including interactive translation services and multilingual communication tools. Theoretically, the introduction of fertility prediction and its successful integration into a non-autoregressive paradigm opens new avenues for addressing sequence generation challenges in other domains.

Future research could explore optimizing the fertility prediction mechanism and further improving the balance between latency and translation quality. Additionally, the potential of extending this framework to other sequence generation tasks, such as speech synthesis and text summarization, is a promising direction for advancing natural language processing technologies.

In conclusion, this paper offers a comprehensive and innovative approach to non-autoregressive neural machine translation. The proposed methods not only push the boundaries of current translation models but also pave the way for future research in efficient and effective sequence generation.

References:

Vaswani, A., Shazeer, N., Parmar, N., et al. Attention is All You Need. 2017.
Gu, J., Bradbury, J., Xiong, C., Socher, R., & Li, V. O. K. Non-Autoregressive Neural Machine Translation. ArXiv:(Gu et al., 2017 )v2, 2018.
Gehring, J., Auli, M., Grangier, D., Yarats, D., & Dauphin, Y. N. Convolutional Sequence to Sequence Learning. ICML 2017.