Non-Autoregressive Models

Updated 6 February 2026

Non-autoregressive models are generative architectures that produce outputs in parallel, eliminating strict token dependencies for faster inference.
They achieve dramatic speedups by processing outputs simultaneously but require specialized strategies like iterative refinement and knowledge distillation to mitigate information loss.
Recent advances integrate multiresolution strategies and proxy objectives to narrow the performance gap with autoregressive models while maintaining efficiency.

A non-autoregressive model (NAR) is a family of generative or predictive models for sequences, grids, or general outputs, characterized by the absence of causal, sequential dependence at inference—output elements are generated in parallel or in a small number of rounds, in contrast to the strictly left-to-right, token-by-token process of autoregressive (AR) architectures. NAR factorization offers dramatic speedups and alleviates error accumulation but imposes strong conditional independence assumptions, driving a need for specific architectures, training regimes, and auxiliary objectives to overcome the loss of inter-output dependencies.

1. Fundamental Factorization and Conditional Independence

In classic AR models, the conditional probability of the output sequence $Y = (y_1, ..., y_T)$ given input $X$ is factorized as: $P_{AR}(Y \mid X) = \prod_{t=1}^T P(y_t \mid y_{<t}, X)\,,$ where $y_{<t}$ are the previously generated outputs. This chain structure enforces left-to-right dependency, requiring $O(T)$ serial steps at inference.

A non-autoregressive model instead assumes that, conditioned on the input, all output tokens are independent: $P_{NAR}(Y \mid X) = \prod_{t=1}^T P(y_t \mid X)\,.$ As a result, all $y_t$ can be generated simultaneously in a single or small number of forward passes, reducing inference complexity to $O(1)$ with respect to output length. This parallelization is realized in practice via architectural choices such as decoders with mask-free attention, query tokens, or multiresolution strategies (Ren et al., 2020, Feng et al., 2023, Shi et al., 2024).

However, this independence comes at the cost of "dropped" cross-token dependencies, often leading to the so-called multimodality problem: the model is unable to capture fine-grained local or structural correlations in $Y$ (Huang et al., 2022, Ren et al., 2020).

2. Core Design Patterns and Model Classes

Several canonical NAR approaches have been developed across domains:

Parallel Decoder Transformers: Standard Transformer decoders with attention masks removed, so all positions are predicted in one pass (Gu et al., 2017, Liu et al., 2022, Feng et al., 2023). Outputs are either directly predicted or iteratively refined.
Connectionist Temporal Classification (CTC): Outputs are over an augmented vocabulary including a blank token, marginalizing over all monotonic alignments that "collapse" to the observed output (Schmidt et al., 2022, Shi et al., 2024, Ma et al., 2023).
Latent Variable Models: Injecting continuous or discrete latent variables (e.g., flow-based or VAE-based priors) to capture inter-token dependencies otherwise lost in the NAR factorization (Ma et al., 2019, Schmidt et al., 2018).
Iterative Mask-Predict or "Fill-and-Revise": Output slots are repeatedly masked and refilled, so the model gradually increases the fidelity of its predictions while sidestepping full autoregressive chains (Feng et al., 2023, Jiang et al., 2021, Patel et al., 18 Dec 2025).
Sequence-level Matching and Reranking: In domains like recommendation, a parallel matching layer generates the entire output permutation at once, leveraging contrastive and sequence-level regularization (Ren et al., 2024).
Multiresolution Divide-and-Conquer: Hierarchical approaches fill in output sequences at progressively finer temporal or spatial scales, always using reliable anchor context (Liu et al., 2019).

Variants further include models that predict insertion positions (insertion-based LMs), infill masked spans in diffusion-style processes, or use a fixed set of output queries processed in parallel (as in NARVL) (Shi et al., 2024, Patel et al., 18 Dec 2025).

3. Training Objectives and Loss Functions

Naive maximum likelihood on independently predicted targets under NAR factorization underfits real data, since natural outputs are rarely fully conditionally independent. The following methods address these difficulties:

Knowledge Distillation (KD): NAR students are trained on "cleaned" or disambiguated outputs from a strong AR teacher. This reduces the multimodality of targets, decreases target-side dependency, and narrows the AR–NAR accuracy gap (Ren et al., 2020, Gu et al., 2017, Schmidt et al., 2022, Huang et al., 2022).
Alignment and Structural Constraints: Techniques such as CTC, fertility modeling, or auxiliary alignment losses enforce structural correspondence between source and target, which is crucial for tasks with monotonic or quasi-monotonic mappings (e.g., speech, text-to-speech, vision–language) (Gu et al., 2017, Schmidt et al., 2022, Shi et al., 2024, Ma et al., 2023).
Iterative Training or Proxy Objectives: Objectives that mix in pseudo-targets, mask subsets (GLAT, MIST), or leverage dynamically refined alignments to recover some or all of the lost cross-token dependency (Jiang et al., 2021, Patel et al., 18 Dec 2025, Feng et al., 2023). The Maximum Proxy-Likelihood Estimation (MPLE) framework unifies these as likelihood maximization on a proxy distribution with reduced conditional total correlation (Huang et al., 2022).
Specialized Losses: Sequence-level unlikelihood, contrastive regularization, or context-aware penalties are used as discriminators or regularizers to avoid repetition, reinforce diversity, or prioritize high-utility outputs (Ren et al., 2024, Su et al., 2021).
CTC Marginalization and DP Decoding: Training with CTC loss marginalizes over all valid alignments, while dynamic programming decoders can enforce desired output length or structure efficiently (Liu et al., 2022, Ma et al., 2023).

4. Applications and Empirical Performance

Machine Translation: NAR models yield 6–7× GPU speedups and ≈1.5–2.5× CPU speedups at the cost of small accuracy drops; best-in-class CTC+GLAT+Deep Supervision approaches achieve BLEU within 0.3 of AR with substantial inference gains (Schmidt et al., 2022). KD and alignment constraints further shrink the gap, particularly for languages with weaker target-side dependencies (Ren et al., 2020, Gu et al., 2017).

Text-to-Image: Emage demonstrates that NAR text-to-image models can approach the fidelity of strong AR baselines with a 50× latency reduction (FID ≈20, 1s/image at 256×256 on a V100) using VQGAN tokenization and iterative fill-and-revise decoding (Feng et al., 2023).

Vision–Language: NARVL's query-CTC loss enables constant-time parallel generation with competitive accuracy in grounding, entailment, captioning, and VQA, offering 2.4–12.7× speedups (Shi et al., 2024).

Human Motion Prediction and Time Series: Multitask NAR decoders avoid error accumulation, yielding accuracy improvements or parity to AR in both short and long-term horizons (Li et al., 2020, Maulik et al., 2020, Shen et al., 2023).

Recommendation and Routing: NAR4Rec and GNARKD demonstrate high accuracy at extreme speedups in recommender reranking and VRP, with loss in optimality capped at 2–3% but 4–9× speedup over AR counterparts (Ren et al., 2024, Xiao et al., 2023).

Domain	State-of-the-Art NAR Scheme	Latency vs. AR	Quality Gap
NMT	CTC+GLAT+DS	6–7× faster	≤0.3 BLEU
Text-to-Image	Emage Iterative NAR	50× faster	+2.7 FID
Vision–Language (VQA)	NARVL Query-CTC	12.7× faster	–1.8% acc. (VQA v2)
Recommender Rerank	NAR4Rec	5× faster	+1.2% user metrics
VRP	GNARKD	4–9× faster	2–3% longer tours

Task-dependent factors such as the conditional total correlation (a measure of target-side dependency) directly influence when NAR approaches can achieve AR-level performance (Huang et al., 2022, Ren et al., 2020).

5. Model Limitations and Theoretical Analyses

Conditional independence severely limits NAR models in tasks with strong target-auto-correlations, leading to repeated tokens, local incoherence, or missing global structure (Ren et al., 2020, Huang et al., 2022). For unconditional or high-entropy generation, NAR models relying solely on MLE estimation match only the marginals, with information loss precisely governed by the conditional total correlation of the target given the input. The minimum achievable KL divergence between true data and any NAR model is lower-bounded by this total correlation (Huang et al., 2022).

NAR models thus require proxy objectives (e.g., KD, alignment, masked-predict) to construct less multimodal training targets, with theoretical and empirical work quantifying the trade-off between parallelism and information loss (Huang et al., 2022, Ren et al., 2020). Flow-based and latent-variable NAR models (e.g., FlowSeq) offer increased capacity but add architectural and training complexity (Ma et al., 2019, Schmidt et al., 2018).

In practice, iterative refinement methods reduce the information gap by partially recovering conditional dependencies while maintaining most NAR inference speed. Training and evaluation thus require careful balance:

Weakening the independence assumption (e.g., via masked refinement, iterative infilling) improves fidelity at modest speed cost (Feng et al., 2023, Jiang et al., 2021).
Over-distilled or excessively simplified proxy targets can degrade model robustness and generalization (Huang et al., 2022).

6. Progressive Advances and Domain-Specific Adaptations

Modern NAR research connects transformer-style architectures (NAT), error-correction schemes (iterative infill), probabilistic latent variables (flow, VAE), and structural modeling (CTC, fertility, insertion) under unified frameworks (Gu et al., 2017, Ma et al., 2019, Patel et al., 18 Dec 2025, Jiang et al., 2021). Domain-specific innovations are prominent:

Speech and TTS: NAR models can fully close the AR–NAR gap, as target-side dependencies are weak; alignment/duration constraints are straightforward (Ren et al., 2020).
Time Series Forecasting: NAR diffusion models with tailored conditioning (future mixup, autoregressive initialization) outperform AR/diffusive baselines and yield two to three orders of magnitude speed improvements (Shen et al., 2023, Maulik et al., 2020).
Simultaneous and Streaming Tasks: NAST demonstrates CTC-style parallel writing with chunked upsampling, offering low-latency, high-quality output for SiMT under strict read/write regimes (Ma et al., 2023).
Reranking/Combinatorics: Non-autoregressive matching models with sequence-level unlikelihood effectively scale to large candidate sets with dynamic item pools, as in large-scale recommendation (Ren et al., 2024).

7. Challenges, Evaluation, and Future Directions

Several challenges persist:

Training stability and convergence, especially in high-entropy domains, often necessitate strong initialization (e.g., pretrained encoders such as CLIP or BERT), robust loss schedules, and careful hyperparameter optimization (Feng et al., 2023, Su et al., 2021).
Evaluation must be standardized (e.g., sacreBLEU for NMT) due to BLEU variations up to 1.7 points with different tokenization, and both CPU and GPU latencies should be reported to reflect real-world deployment effects (Schmidt et al., 2022).
Extending NAR principles to broader classes of tasks—e.g., iterative CTC with non-monotonic alignments, controllable text/image synthesis via latent-variable NAR models—remains technically rich.

Emerging lines of work include:

Learning or adapting proxy distributions and alignments end-to-end for minimal information loss (Huang et al., 2022).
Hybrid models that interpolate between AR and NAR by selectively injecting autoregressive dependencies or chaining refinement steps (Feng et al., 2023, Jiang et al., 2021).
Further theoretical exploration of information-theoretic performance bounds, the role of conditional total correlation, and parallelism-fidelity frontiers.

In sum, non-autoregressive models constitute a broad, highly active area in sequence and structured prediction, yielding principled acceleration across language, vision, time series, and combinatorial domains, but necessitate specialized modeling, training, and evaluation regimes to mitigate the inherent limits imposed by output independence (Feng et al., 2023, Ren et al., 2020, Gu et al., 2017, Shi et al., 2024, Patel et al., 18 Dec 2025, Li et al., 2020, Huang et al., 2022, Ma et al., 2023, Kurnosikov et al., 2022, Xiao et al., 2023).

Markdown Upgrade to Chat

References (20)

A Study of Non-autoregressive Model for Sequence Generation (2020)

Emage: Non-Autoregressive Text-to-Image Generation (2023)

Non-autoregressive Sequence-to-Sequence Vision-Language Models (2024)

On the Learning of Non-Autoregressive Transformers (2022)

Non-Autoregressive Neural Machine Translation (2017)

Learning Non-Autoregressive Models from Search for Unsupervised Sentence Summarization (2022)

Non-Autoregressive Neural Machine Translation: A Call for Clarity (2022)

Non-autoregressive Streaming Transformer for Simultaneous Translation (2023)

FlowSeq: Non-Autoregressive Conditional Sequence Generation with Generative Flow (2019)

10.

Deep State Space Models for Unconditional Word Generation (2018)

11.

Improving Non-autoregressive Generation with Mixup Training (2021)

12.

XLM: A Python package for non-autoregressive language models (2025)

13.

Non-autoregressive Generative Models for Reranking Recommendation (2024)

14.

NAOMI: Non-Autoregressive Multiresolution Sequence Imputation (2019)

15.

Non-Autoregressive Text Generation with Pre-trained Language Models (2021)

16.

Multitask Non-Autoregressive Model for Human Motion Prediction (2020)

17.

Non-autoregressive time-series methods for stable parametric reduced-order models (2020)

18.

Non-autoregressive Conditional Diffusion Models for Time Series Prediction (2023)

19.

Distilling Autoregressive Models to Obtain High-Performance Non-Autoregressive Solvers for Vehicle Routing Problems with Faster Inference Speed (2023)

20.

Bringing ultimate depth to scanning tunnelling microscopy: deep subsurface vision of buried nano-objects in metals (2022)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Non-Autoregressive Models.

Non-Autoregressive Models

1. Fundamental Factorization and Conditional Independence

2. Core Design Patterns and Model Classes

3. Training Objectives and Loss Functions

4. Applications and Empirical Performance

5. Model Limitations and Theoretical Analyses

6. Progressive Advances and Domain-Specific Adaptations

7. Challenges, Evaluation, and Future Directions

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Don't miss out on important new AI/ML research

Sign up for free to explore the frontiers of research

Non-Autoregressive Models

1. Fundamental Factorization and Conditional Independence

2. Core Design Patterns and Model Classes

3. Training Objectives and Loss Functions

4. Applications and Empirical Performance

5. Model Limitations and Theoretical Analyses

6. Progressive Advances and Domain-Specific Adaptations

7. Challenges, Evaluation, and Future Directions

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Related Topics

Don't miss out on important new AI/ML research

Sign up for free to explore the frontiers of research