Non-Autoregressive Machine Translation with Disentangled Context Transformer (2001.05136v2)

Published 15 Jan 2020 in cs.CL

Abstract: State-of-the-art neural machine translation models generate a translation from left to right and every step is conditioned on the previously generated tokens. The sequential nature of this generation process causes fundamental latency in inference since we cannot generate multiple tokens in each sentence in parallel. We propose an attention-masking based model, called Disentangled Context (DisCo) transformer, that simultaneously generates all tokens given different contexts. The DisCo transformer is trained to predict every output token given an arbitrary subset of the other reference tokens. We also develop the parallel easy-first inference algorithm, which iteratively refines every token in parallel and reduces the number of required iterations. Our extensive experiments on 7 translation directions with varying data sizes demonstrate that our model achieves competitive, if not better, performance compared to the state of the art in non-autoregressive machine translation while significantly reducing decoding time on average. Our code is available at https://github.com/facebookresearch/DisCo.

Citations (31)

View on Semantic Scholar

Summary

The paper introduces the DisCo Transformer architecture that uses attention masking to enable parallel token prediction, significantly reducing latency compared to autoregressive models.
It presents a novel easy-first inference algorithm that concurrently predicts tokens, optimizing output quality with fewer iterations.
Empirical results across seven translation directions demonstrate competitive performance and scalability, particularly with large bitext datasets.

An Evaluation of the Disentangled Context Transformer Model for Non-Autoregressive Machine Translation

The paper presents the Disentangled Context (DisCo) Transformer, a novel architecture for non-autoregressive machine translation. Traditional neural machine translation models predict tokens sequentially, from left to right, based on the preceding tokens. This autoregressive approach suffers from latency issues due to its inherently sequential nature. Non-autoregressive translation (NAT) models offer parallelism, predicting multiple tokens simultaneously, but they often exhibit degraded performance.

DisCo Transformer Architecture and Objective

The DisCo Transformer employs attention masking to predict each word in a sentence based on an arbitrary subset of other words. The model seeks to overcome the inefficiency of conditional masked LLMs (CMLMs) that are restricted to predict only masked tokens. By training the DisCo Transformer to predict every word given various contexts, the model achieves faster inference and improved performance, especially with larger datasets.

The paper introduces the DisCo objective, which generalizes the task of predicting tokens with permuted contexts, providing flexibility in training. The DisCo architecture modifies the attention mechanism by introducing contextless keys and values, allowing efficient computation of conditional probabilities across multiple contexts in one pass.

Innovative Inference Techniques

Alongside the DisCo architecture, a parallel easy-first inference algorithm is proposed. This method predicts all tokens concurrently, leveraging model confidence to optimize outputs in fewer iterations, in contrast to conventional methods like mask-predict which iteratively refine and update masked tokens.

Empirical Evaluation

The DisCo approach and its inference method demonstrate competitive performance against state-of-the-art autoregressive and NAT models across seven translation directions, with significant reductions in average decoding time. Experiments reveal that DisCo performs notably well with large bitext data, suggesting scalability benefits from its design.

Implications and Future Directions

The DisCo Transformer presents a promising direction for enhancing NAT models, balancing the trade-off between parallel decoding and translation quality. It also opens avenues for broader applications in efficient general-purpose representation learning, particularly leveraging its ability to condition predictions flexibly on permuted subsets of input data. Future research could explore integrating the DisCo architecture with other NAT improvements and extend its application to diverse language processing tasks.

The findings underline the potential of non-autoregressive approaches as viable alternatives to autoregressive models, though challenges remain in bridging performance gaps under scenarios of enhanced data complexity, such as large-scale learning environments. Researchers are encouraged to build upon the scaler-friendly inference and training efficiency the DisCo Transformer demonstrates, advancing machine translation methodologies into more computationally economical regimes.

Overall, the DisCo Transformer introduces novel architectural and algorithmic strategies that advance the non-autoregressive machine translation domain, and the results provide key insights into effective parallel processing models.

PDF Markdown

Related Papers

GitHub

GitHub - facebookresearch/DisCo: DisCo Transformer for Non-autoregressive MT (78 stars)