Mask-Predict: Parallel Decoding of Conditional Masked Language Models (1904.09324v2)

Published 19 Apr 2019 in cs.CL, cs.AI, cs.LG, and stat.ML

Abstract: Most machine translation systems generate text autoregressively from left to right. We, instead, use a masked LLMing objective to train a model to predict any subset of the target words, conditioned on both the input text and a partially masked target translation. This approach allows for efficient iterative decoding, where we first predict all of the target words non-autoregressively, and then repeatedly mask out and regenerate the subset of words that the model is least confident about. By applying this strategy for a constant number of iterations, our model improves state-of-the-art performance levels for non-autoregressive and parallel decoding translation models by over 4 BLEU on average. It is also able to reach within about 1 BLEU point of a typical left-to-right transformer model, while decoding significantly faster.

Citations (35)

View on Semantic Scholar

Summary

Mask-Predict: A Novel Approach for Parallel Decoding in Machine Translation

The paper "Mask-Predict: Parallel Decoding of Conditional Masked LLMs" presents an innovative approach to accelerate machine translation by leveraging masked LLMs for parallel decoding. This research primarily investigates the limitations of autoregressive text generation and proposes a conditional masked LLM (CMLM) to enhance translation efficiency without significantly compromising performance.

Key Contributions

The authors introduce a parallel decoding algorithm, termed "mask-predict," that capitalizes on the properties of CMLMs. This methodology contrasts with traditional autoregressive models, which rely on sequential left-to-right word prediction. Key innovations and contributions include:

Conditional Masked LLMs (CMLM): The authors have adapted the sequence-to-sequence architecture using a masked LLM objective. Unlike conventional models that use left-to-right constraints, CMLMs predict arbitrary subsets of target words in parallel, leveraging bi-directional context for improved accuracy.
Mask-Predict Algorithm: Central to the paper, this algorithm incrementally improves translation outputs by iteratively predicting masked segments of the target sequence. Initially, the whole target sequence is masked, and the model predicts all words simultaneously. In subsequent iterations, only the least confidently predicted words are re-masked and regenerated. This iterative refinement with a fixed number of cycles achieves state-of-the-art performance for non-autoregressive models.
Empirical Performance Enhancements: The mask-predict decoding outperforms existing non-autoregressive approaches by more than 4 BLEU points and nearly matches the quality of autoregressive models with merely 1 BLEU point discrepancy on average. Specifically, it achieves a 4-5 BLEU point improvement on the WMT'14 English-German task and up to 3 BLEU points on the WMT'16 English-Romanian task, with considerable gains in decoding speed.

Experimental Methodology and Insights

The machine translation evaluation across several benchmarks (WMT'14 EN-DE, WMT'16 EN-RO, and WMT'17 EN-ZH) demonstrated the efficiency and quality of mask-predict. Extensive experiments revealed:

Iteration Impact: Significantly, just four iterations of mask-predict were adequate to exceed the best non-autoregressive models' scores.
Translation Length and Performance Correlation: While a constant number of iterations are used, the algorithm's performance remained robust across varying sequence lengths. The practical speedup was evident with a comparable quality-to-performance trade-off, achieving over three times faster decoding with slight quality reductions.
Role of Model Distillation: Consistent with prior work, model distillation—training on outputs from a pre-trained autoregressive model—proved crucial for optimizing CMLM performance.

Theoretical and Practical Implications

The research delineates a path forward for efficient and effective non-autoregressive machine translation, with significant implications:

Theoretical Advancement: This work posits that masked LLMs can serve not only in representation and understanding tasks but also as a formidable foundation for text generation.
Practical Application: The remarkable speed gains offered by CMLMs make them highly suitable for real-time translation applications, where latency is a critical constraint.

Future Directions

Further exploration could address some limitations, such as dependency on target sequence length prediction and potential enhancements in decoding flexibility without distillation. Moreover, extending this approach to other conditional sequence generation domains, such as dialogue systems or summarization, could substantially benefit from parallel decoding efficiencies.

In conclusion, the "Mask-Predict" approach marks a substantive stride in machine translation, providing a balanced compromise between speed and accuracy while charting a promising direction for parallel text generation methodologies.

Related Papers

Tweets

https://twitter.com/ShikharMohan0/status/1748414153155608944

https://twitter.com/bergbenj/status/1767101469046497579

https://twitter.com/realHongyu_Wang/status/1927664304754761775

YouTube

Show All Videos