Non-Autoregressive & Masked Prediction

Updated 27 August 2025

Non-autoregressive and masked prediction is a modeling strategy that generates tokens in parallel by inferring missing elements, enabling faster generation.
It employs iterative refinement by re-masking low-confidence tokens to enhance prediction coherence in applications like machine translation and ASR.
Specialized loss functions such as AXE and partition-based methods improve alignment and quality, balancing efficiency with output accuracy.

Non-autoregressive and masked prediction refers to a family of modeling and inference strategies in which sequence (or structured) generation proceeds in parallel without strictly conditioning each prediction on the previous output. Instead, missing or masked elements are inferred simultaneously or in iterative rounds, supported by a model trained to fill in arbitrary subsets of missing content. This approach stands in contrast to the traditional autoregressive paradigm, which generates each output in a strict sequential (e.g., left-to-right) order, conditioning on all previously generated outputs. Recent research demonstrates that non-autoregressive and masked prediction can dramatically accelerate inference while attaining competitive or state-of-the-art accuracy across machine translation, speech recognition, image/video generation, and more, by leveraging iterative masked refinement, bi-directional context, and specialized loss functions or training schemes.

1. Fundamental Principles and Architectural Variants

Non-autoregressive models eschew strict sequential dependence in favor of predicting target tokens (or structured outputs) in parallel. The prototypical architecture is the conditional masked LLM (CMLM), which, in its canonical instantiation (Ghazvininejad et al., 2019), employs a Transformer encoder to obtain source representations and a bi-directional (i.e., causally unmasked) Transformer decoder for target sequence prediction:

The decoder predicts each masked token independently, attending to both past and future context within the sequence and to the source.
This architecture facilitates parallel prediction and enables efficient utilization of hardware accelerators.

Variants extend to multimodal domains:

Image captioning models (Gao et al., 2019) adapt the encoder for visual features and inherit the non-causal masking of the decoder.
Non-autoregressive ASR systems (Chen et al., 2019, Higuchi et al., 2020, Futami et al., 2022) combine acoustic encoders with masked language modeling for output token inference.
Non-autoregressive predictive coding (Liu et al., 2020) employs Masked Convolution Blocks to restrict dependencies to local context.
Video and audio generative transformers (Gupta et al., 2022, Ziv et al., 9 Jan 2024, Ma et al., 2023) process discrete tokens obtained via VQ-VAEs in parallel refinement steps.

Recent innovations, such as Partition Generative Models (Deschenaux et al., 24 May 2025), remove the need for explicit MASK tokens by distributing input tokens into groups and using sparse attention to prevent information exchange across partitions, achieving high efficiency.

Training is typically based on masked language modeling, adapted for the target domain:

A random subset of output positions is masked, and the model is trained via cross-entropy loss to predict only the masked entries, conditioning on observed (unmasked) tokens and the source:

$\mathcal{L} = \sum_{i \in Y_\text{mask}} -\log P(y_i|X, Y_\text{obs})$

(Ghazvininejad et al., 2019, Gao et al., 2019).

For sequence-to-sequence tasks, a prediction for sequence length (e.g., a dedicated LENGTH token) may be included.

Inference utilizes iterative refinement (Ghazvininejad et al., 2019):

Initially, all target positions are masked and filled in parallel.
At each iteration, a subset of tokens with lowest confidence (lowest $p_i$ ) is re-masked and updated, allowing uncertain predictions to be refined as context strengthens.
The number of tokens to re-mask typically decays linearly with each iteration: $n = N \cdot \frac{T-t}{T}$ .
Some systems adopt “easy first” refinement, freezing confident predictions first (Chen et al., 2019).

In computer vision, masked prediction models for video and images recursively "unmask" latent tokens, guided by token confidence or a scheduling function, until all tokens are filled (Gupta et al., 2022, Ma et al., 2023, Hu et al., 9 Dec 2024).

3. Loss Functions and Alignment Techniques

Standard cross-entropy loss in non-autoregressive models can be misaligned with human similarity (e.g., penalizing minor position shifts heavily). This motivates specialized objectives:

Aligned Cross Entropy (AXE) (Ghazvininejad et al., 2020) replaces vanilla cross-entropy with a differentiable dynamic program that minimizes the loss over monotonic alignments between predicted and target tokens, mitigating the order sensitivity and reducing harsh penalties for small shifts. The AXE loss is:

$\text{AXE}(Y, P) = \min_{\alpha} \left\{-\sum_{i=1}^n \log P_{\alpha(i)}(Y_i) -\sum_{k \notin \{\alpha(1),\ldots,\alpha(n)\}} \log P_k(\epsilon) \right\}$

Here, $\epsilon$ designates a "blank" token allowing for flexible alignment.

Other frameworks incorporate partial masking and partition-based losses (Chao et al., 24 May 2025, Deschenaux et al., 24 May 2025), which either smooth the transition between masked/unmasked states (via sub-tokenization and partial masking) or eliminate the MASK token by training models to predict tokens in one group from context in the other, leveraging custom attention patterns.

4. Efficiency, Trade-offs, and Performance Analysis

Non-autoregressive and masked prediction models yield significant acceleration at inference. Key empirical findings include:

In machine translation, Mask-Predict (Ghazvininejad et al., 2019) delivers an average +4 BLEU improvement over previous NAR approaches and achieves within 0.5–1.2 BLEU points of a strong AR baseline, while decoding up to 3× faster by predicting entire sequences in a constant number of steps.
In ASR, models such as Mask CTC (Higuchi et al., 2020) and A-FMLM (Chen et al., 2019) achieve up to 7–9× inference speedups with competitive word/character error rates relative to AR systems.
In batch and high-resolution applications (video, image, audio), MaskViT (Gupta et al., 2022) enables up to 500× speedup compared to AR models, and MAGNeT (Ziv et al., 9 Jan 2024) achieves 7× lower latency with comparable or improved Fréchet distances and human quality evaluations.
Masked non-autoregressive decoding in image captioning (Gao et al., 2019) and trajectory prediction (Xue et al., 2020, Chen et al., 2023) shows that semantic content preservation and prediction diversity exceed those of purely AR models.

Trade-offs arise with respect to multimodality and error propagation:

NAR models predict tokens independently in each iteration, which can yield incoherent outputs (e.g., repeated phrases, inconsistencies). Iterative masked refinement helps collapse modes and produce coherent, globally consistent outputs.
Training/inference mismatches (exposure bias) may be addressed by semi-autoregressive approaches such as SMART (Ghazvininejad et al., 2020), which incorporates model predictions into the training data to align procedure with test-time dynamics, yielding up to 1 BLEU point improvement.

Recent developments in loss scaling and adaptive masking (e.g., AMOM (Xiao et al., 2023)) demonstrate further accuracy improvements and enhanced convergence speed.

5. Extensions: Theory, Identifiability, and Unified Discrete Modeling

Recent theoretical studies (Liu et al., 2022) have interrogated the parameter identifiability of masked prediction objectives:

For Hidden Markov Models (HMM), predicting a single masked token is insufficient for identifiability, but predicting joint statistics (e.g., a tensor product of two tokens given another) ensures recovery of the generative parameters under mild uniqueness assumptions (Kruskal’s theorem).
The design of the masked prediction task, as well as the generative structure (discrete vs. continuous observations), crucially determines whether model parameters are identifiable from the masked-inference solution.

Unified discrete interpolation frameworks (Hu et al., 9 Dec 2024) treat both masked generative models and discrete-state diffusion models as points in a continuous design space:

The progressive unmasking process is expressed via a probability path $p_{t|0,1}(x|x_0,x_1)$ modulated by a schedule $\kappa_t$ ,
Conditional sampling and classifier-free guidance are incorporated in the same unified framework, supporting both generation and structured discriminative tasks (e.g., segmentation recast as unmasking).
These approaches achieve strong results on ImageNet256, MS COCO, and FaceForensics, bridging generative and discriminative domains.

6. Variants and Future Directions

Innovations continue to emerge:

Masked prediction without explicit MASK tokens, via Partition Generative Models (Deschenaux et al., 24 May 2025), achieves greater computational efficiency and throughput (up to 5× improvement), with self-distillation through time (SDTT) providing further inference gains.
Partial masking (Prime) (Chao et al., 24 May 2025) introduces fine-grained sub-token intermediate states, reducing idle sampling steps and leading to improved perplexity and FID scores over competitive AR and hybrid models, without the limitations of order and latency.
Hybrid models (combining AR and NAR sections within a sequence) strike a balance between generation quality and throughput (Ziv et al., 9 Jan 2024, Xi et al., 30 May 2025).
Adaptive and continual masking, structural guidance, and non-autoregressive token critics are being explored across video, audio, and language modeling.

Limitations involve conditional independence assumptions, which may restrict capture of complex dependencies, and the need for careful scheduling, masking strategies, and architectural adaptation to specific data modalities. Research into richer forms of context propagation, advanced loss formulations, and efficient discrete-state modeling is ongoing.

7. Applications and Impact Across Domains

Non-autoregressive and masked prediction frameworks have demonstrated efficacy in:

Machine translation, offering near-parity in BLEU with state-of-the-art AR models and lower inference latency (Ghazvininejad et al., 2019, Ghazvininejad et al., 2020, Ghazvininejad et al., 2020, Xiao et al., 2023).
Speech recognition (ASR), such as CTC-based and transducer-based architectures with accuracy and latency improvements (Chen et al., 2019, Higuchi et al., 2020, Futami et al., 2022, Xi et al., 30 May 2025).
Computer vision, notably in masked image and video generation, where speed and diversity scale to high-resolution settings (256×256) and time-critical robotic applications (Gupta et al., 2022, Lezama et al., 2022, Ma et al., 2023, Hu et al., 9 Dec 2024).
Audio and music generation, exemplified by MAGNeT’s span-masked token methodology for fast text-to-audio synthesis (Ziv et al., 9 Jan 2024).
Trajectory prediction, employing masking for robust self-supervised representation learning and multi-granularity map/trajectory encoding (Xue et al., 2020, Chen et al., 2023).

These advances are poised to enable new classes of real-time, interactive, and highly parallelizable systems, while the evolution of masking, partitioning, and iterative refinement will continue to optimize the trade-off between efficiency and expressivity in structured generation.