- The paper introduces the DisCo Transformer architecture that uses attention masking to enable parallel token prediction, significantly reducing latency compared to autoregressive models.
- It presents a novel easy-first inference algorithm that concurrently predicts tokens, optimizing output quality with fewer iterations.
- Empirical results across seven translation directions demonstrate competitive performance and scalability, particularly with large bitext datasets.
An Evaluation of the Disentangled Context Transformer Model for Non-Autoregressive Machine Translation
The paper presents the Disentangled Context (DisCo) Transformer, a novel architecture for non-autoregressive machine translation. Traditional neural machine translation models predict tokens sequentially, from left to right, based on the preceding tokens. This autoregressive approach suffers from latency issues due to its inherently sequential nature. Non-autoregressive translation (NAT) models offer parallelism, predicting multiple tokens simultaneously, but they often exhibit degraded performance.
DisCo Transformer Architecture and Objective
The DisCo Transformer employs attention masking to predict each word in a sentence based on an arbitrary subset of other words. The model seeks to overcome the inefficiency of conditional masked LLMs (CMLMs) that are restricted to predict only masked tokens. By training the DisCo Transformer to predict every word given various contexts, the model achieves faster inference and improved performance, especially with larger datasets.
The paper introduces the DisCo objective, which generalizes the task of predicting tokens with permuted contexts, providing flexibility in training. The DisCo architecture modifies the attention mechanism by introducing contextless keys and values, allowing efficient computation of conditional probabilities across multiple contexts in one pass.
Innovative Inference Techniques
Alongside the DisCo architecture, a parallel easy-first inference algorithm is proposed. This method predicts all tokens concurrently, leveraging model confidence to optimize outputs in fewer iterations, in contrast to conventional methods like mask-predict which iteratively refine and update masked tokens.
Empirical Evaluation
The DisCo approach and its inference method demonstrate competitive performance against state-of-the-art autoregressive and NAT models across seven translation directions, with significant reductions in average decoding time. Experiments reveal that DisCo performs notably well with large bitext data, suggesting scalability benefits from its design.
Implications and Future Directions
The DisCo Transformer presents a promising direction for enhancing NAT models, balancing the trade-off between parallel decoding and translation quality. It also opens avenues for broader applications in efficient general-purpose representation learning, particularly leveraging its ability to condition predictions flexibly on permuted subsets of input data. Future research could explore integrating the DisCo architecture with other NAT improvements and extend its application to diverse language processing tasks.
The findings underline the potential of non-autoregressive approaches as viable alternatives to autoregressive models, though challenges remain in bridging performance gaps under scenarios of enhanced data complexity, such as large-scale learning environments. Researchers are encouraged to build upon the scaler-friendly inference and training efficiency the DisCo Transformer demonstrates, advancing machine translation methodologies into more computationally economical regimes.
Overall, the DisCo Transformer introduces novel architectural and algorithmic strategies that advance the non-autoregressive machine translation domain, and the results provide key insights into effective parallel processing models.