Non-Autoregressive Generation (NAG) Models

Updated 25 February 2026

Non-Autoregressive Generation (NAG) is a class of sequence models that predicts tokens in parallel, reducing latency with conditional independence assumptions.
Key methodologies include iterative refinement, latent-variable and hybrid AR/NAR strategies that improve quality while maintaining speed.
NAG models offer significant efficiency gains but face challenges in capturing long-range dependencies and addressing multimodal outputs.

Non-Autoregressive Generation (NAG) refers to a class of sequence generation models that dispense with the stepwise, left-to-right dependency structure of classic autoregressive (AR) decoders. In NAG, output tokens—or batches of tokens—are predicted independently or with only limited dependency, often in parallel. This paradigm is motivated by the need to dramatically reduce inference latency and enhance parallelism, particularly important for industrial, interactive, or real-time applications across natural language, audio, vision, and recommendation domains. Core challenges of NAG include capturing long-range inter-token dependencies, resolving multi-modality, and preserving the overall output quality relative to AR models. Research on NAG encompasses a broad spectrum of modeling strategies, theoretical analyses, and practical system designs.

1. Autoregressive vs. Non-Autoregressive Factorizations

The AR framework factorizes the target sequence probability as

$P(y|x) = \prod_{t=1}^T P(y_t|y_{<t}, x)$

meaning each output token $y_t$ directly conditions on all previous outputs. This sequential dependency enforces strict causal order but precludes parallel decoding.

By contrast, the canonical NAG factorization is

$P(y|x) \approx \prod_{t=1}^T P(y_t|x)$

which assumes conditional independence of targets given the input, enabling full parallelism across $t$ . This factorization underlies early NAT models (Ren et al., 2020, Huang et al., 2022). For order-invariant outputs, such as bundles or unordered lists, NAG relaxes further to set-based formulations: $P(B|u, I, O) = \sum_{v \in B} p(v|u, O)$ eliminating even positional dependence (Yang et al., 2024, Liu et al., 2023).

Recent generalizations include:

Iterative refinement (e.g., mask-predict, masked diffusion), where a sequence is constructed over $K \ll T$ refinement steps rather than $T$ serial steps (Ziv et al., 2024, Feng et al., 2023, Wu et al., 18 Feb 2026, Zeng et al., 18 Aug 2025).
Latent-variable and hybrid models that reintroduce limited causal structure while maintaining substantial parallelism (Bao et al., 2019, Ziv et al., 2024, Han et al., 2020).

2. Architectures and Decoding Strategies

2.1 Fully Parallel Decoders

Classic NAG models rely on fully parallel decoders, often standard Transformer stacks with positional embeddings and cross-attention to the encoded source (Huang et al., 2022, Qi et al., 2022). Softmax classifiers predict all tokens simultaneously. Encoder-only (BERT-based) NAG is used for summarization and compression (Su et al., 2021, Jiang et al., 2021). In text-to-image, NAG uses VQVAE-based codebook decoders with CLIP-like text encoders (Feng et al., 2023).

2.2 Iterative Masked-Predict and Diffusion

To address multi-modality and compensate for the loss of dependency information, iterative NAG decoders repeatedly mask and predict (or refine) subsets of positions:

MaskPredict/Refinement models: Predict tokens where the model is least certain, remask, and update across multiple rounds (Ziv et al., 2024, Feng et al., 2023, Wu et al., 18 Feb 2026).
Masked Diffusion and Flow Matching: These approaches interpret NAG as a discrete diffusion or flow process over the token simplex, training denoisers across a continuum of corruption levels; samplers are designed to efficiently traverse this space via masking and denoising (Wu et al., 18 Feb 2026, Sevriugov et al., 2024).

2.3 Set & Bundle Decoders

In settings with order-invariant outputs, permutation-equivariant or "one-shot" decoders pool encoder representations and predict unordered sets using Hungarian-matched or order-agnostic cross-entropy losses (Yang et al., 2024, Liu et al., 2023).

2.4 Hybrid and Multi-Stream Models

Some models combine AR and NAR elements:

Hybrid AR/NAR switching: Generate early positions autoregressively, then switch to NAR for completion, trading off quality vs latency (Ziv et al., 2024).
Multi-Stream masking: Train a unified model to support both AR and NAR decoding via appropriately structured attention masks (Qi et al., 2022).
Latent-position models: Treat output positions or permutations as explicit latent variables, partially recovering dependency while maintaining parallelism (Bao et al., 2019).

3. Training Objectives and Theoretical Analysis

NAG models are commonly trained with per-token cross-entropy objectives under the (possibly conditional) independence assumptions of their respective factorization (Huang et al., 2022, Qi et al., 2022). However, this naively fits only the marginal label distributions: $\ell_{\rm NAG}(\theta) = - \mathbb{E}_{(X,Y)}\left[\sum_{t=1}^T \log p_{\theta}(y_t|X)\right]$ and provably discards the conditional total correlation (multi-information) $\Delta=\sum_i H(y_i|X) - H(Y|X)$ present in the dataset (Huang et al., 2022, Ren et al., 2020). This loss in dependency is the principal reason for the quality gap between NAG and AR.

Proxy Distribution and MPLE Perspective

Successful NAG variants can be reframed as maximizing the likelihood on proxy target distributions with lower total correlation—e.g., knowledge-distilled outputs, simplex-interpolated flows, or iteratively denoised states—plus a distortion regularizer to ensure proxy fidelity (Huang et al., 2022, Sevriugov et al., 2024). Techniques such as knowledge distillation (Ren et al., 2020, Qi et al., 2022), input/output noise schedules (Wu et al., 18 Feb 2026), or length curriculum learning (Liu et al., 2023) empirically lower the burden of learning inter-token dependency in NAG.

Other objective augmentations include:

Maximum mutual information (MMI) losses: Promote backward/forward consistency and response diversity (Han et al., 2020).
Order-agnostic/combination losses: For unordered targets, minimum-weight matchings or permutation-marginalized CE are used (Yang et al., 2024).
RL and multi-agent counterfactuals: Sentence-level rewards optimize global sequence properties by viewing output positions as "agents" that cooperate to maximize, e.g., BLEU or CIDEr (Guo et al., 2021).

4. Experimental Outcomes and Task-Specific Adaptations

NAG has attained competitive or superior performance to AR on several domains:

Speech and TTS: When target-token dependency is low (e.g., in TTS), NAG matches AR models both in MOS and in end-to-end distortion metrics, leveraging strong source–target alignment (Ren et al., 2020, Jiang et al., 2023).
Machine Translation: Fully NAR models lag AR by several BLEU on unconstrained data, but hybrid (position-latent, distillation, or curriculum learning) approaches can close this gap (Bao et al., 2019, Qi et al., 2022, Huang et al., 2022, Ren et al., 2020).
Summarization, Compression, and Recommendation: BERT-based NAG and curriculum-learned models match or exceed AR speed–accuracy frontiers, yielding 6–14x latency improvements without quality loss (Su et al., 2021, Liu et al., 2023, Yang et al., 2024).
Dialogue generation: Non-AR MMI yields significant BLEU and diversity gains over AR MMI, as backward dependency can be exploited during token selection (Han et al., 2020).
Audio and vision: Single-stage, mask-predict NAG models reach comparable FAD and CLAP scores to AR baselines on text-to-audio, producing 7x speedups (Ziv et al., 2024). In image synthesis, NAG (with iterative refinement) achieves modest FID gaps while being 50x faster (Feng et al., 2023).
Multi-turn dialogues: ToolACE-MT's turn-level non-AR mask-and-fill framework yields substantial data generation efficiency and higher-quality agentic dialogues vs. AR simulators (Zeng et al., 18 Aug 2025).

5. Limitations and Failure Modes

Despite computational advantages, NAG models face several intrinsic limitations:

Lack of explicit token dependency: Pure NAR may miss long-range or structural consistency (e.g., agreement, anaphora, fine-grained details in vision or language) (Huang et al., 2022, Ren et al., 2020, Feng et al., 2023).
Multi-modality: The independent prediction leads to "mode collapse," mixing disparate valid completions (Qi et al., 2022, Jiang et al., 2021). This is partially mitigated by knowledge distillation, latent variables, or iterative refinement.
Fixed-length inflexibility: NAG often requires a predicted/fixed length, which can be relaxed by dynamic EoS schemes, ratio-first decoders, or length predictors (Su et al., 2021, Jiang et al., 2021, Bao et al., 2019).
Efficiency–quality tradeoffs: Finer-grained iterations or hybrid AR/NAR increase quality but reduce parallelism, requiring careful tuning per domain (Ziv et al., 2024, Feng et al., 2023, Bao et al., 2019).
Order-invariance for sets: Standard AR or NAR models embed unwanted bias in unordered set tasks; permutation-equivariant networks and order-agnostic loss functions are required (Yang et al., 2024, Liu et al., 2023).

6. Design Patterns and Emerging Directions

6.1 Architectural Patterns

Empirical studies recommend:

Span-based masking and schedulers: Matching mask size to the tokenizer receptive field optimizes both quality and speed in mask-predict decoders (Ziv et al., 2024).
Order-agnostic/latent position encodings: Explicitly modeling positions as latents or leveraging permutation-invariant architectures addresses set-valued outputs (Bao et al., 2019, Yang et al., 2024).
Efficient attention substitutes: Attentive MLP (AMLP) layers yield linear complexity and outperform other low-memory attention approximations in long-input/long-output tasks (Jiang et al., 2023).
Rescorer fusion and hybrid decoding: Utilizing external AR models or cascaded AR/NAR blocks improves FAD and KL metrics, with marginal latency costs (Ziv et al., 2024, Sevriugov et al., 2024).

6.2 Training and Optimization Practices

Knowledge distillation: AR teacher outputs (hard or soft) reduced target correlation and smoothed modes, mitigating NAG's multi-modality (Qi et al., 2022, Ren et al., 2020, Huang et al., 2022).
Mixup and dynamic augmentation: Dynamic pseudo-target mixup (MIST) and curriculum learning (step-wise masking schedules) aid convergence, particularly for transformer-based NAG (Jiang et al., 2021, Liu et al., 2023).
Multi-agent RL and global rewards: Framing output positions as agents in a multi-agent MARL setting, with counterfactual credit assignment, improves global coherence without trading off speed (Guo et al., 2021).

7. Empirical Benchmarks and Quantitative Trade-offs

Table: Representative NAG performance and speedup (selected results)

Domain	Model (NAG ref)	Metric/Quality	AR vs NAG Speedup	Comments
MT (WMT'14)	NAT w/ KD (Ren et al., 2020)	BLEU ≈27.1 vs 33.9	13.5x	Gap remains post-KD
Summarization	BNAG-CRF (Su et al., 2021)	R-L 33.28 vs 33.43	6.7x—14x	BERT backbone, CRF top
Dialogue	NonAR+MMI (Han et al., 2020)	BLEU 2.68 vs 2.10	≫1x	Highest human “agree” score
TTS	FastSpeech2-NAR (Jiang et al., 2023)	MOS 3.79 vs 3.82	~10x	NAR matches AR MOS
Audio	MAGNeT (Ziv et al., 2024)	FAD ~3.3 vs 3.7	7x	With external rescoring
Images	Emage (Feng et al., 2023)	FID 19.74 NAR vs 17 AR	50x	Iterative, not fully one-pass
Bundle Rec.	BundleNAT (Yang et al., 2024)	Prec@5 .809 vs .550	16—126x	Fully permutation-equivariant

A plausible implication is that the domain-specific strength of target-token dependency is the primary determinant for the AR–NAR performance gap, with TTS and item recommendation being most amenable to NAG, and NMT or machine translation remaining challenging without advanced proxying or hybridization (Ren et al., 2020, Huang et al., 2022, Jiang et al., 2023, Liu et al., 2023).

References:

(Ziv et al., 2024) Masked Audio Generation using a Single Non-Autoregressive Transformer
(Ren et al., 2020) A Study of Non-autoregressive Model for Sequence Generation
(Huang et al., 2022) On the Learning of Non-Autoregressive Transformers
(Yang et al., 2024) Non-autoregressive Personalized Bundle Generation
(Bao et al., 2019) Non-autoregressive Transformer by Position Learning
(Feng et al., 2023) Emage: Non-Autoregressive Text-to-Image Generation
(Su et al., 2021) Non-Autoregressive Text Generation with Pre-trained LLMs
(Jiang et al., 2021) Improving Non-autoregressive Generation with Mixup Training
(Han et al., 2020) Non-Autoregressive Neural Dialogue Generation
(Liu et al., 2023) FANS: Fast Non-Autoregressive Sequence Generation for Item List Continuation
(Jiang et al., 2023) Attentive Multi-Layer Perceptron for Non-autoregressive Generation
(Qi et al., 2022) A Self-Paced Mixed Distillation Method for Non-Autoregressive Generation
(Wu et al., 18 Feb 2026) Discrete Stochastic Localization for Non-autoregressive Generation
(Sevriugov et al., 2024) KL-geodesics flow matching with a novel sampling scheme
(Zeng et al., 18 Aug 2025) ToolACE-MT: Non-Autoregressive Generation for Agentic Multi-Turn Interaction
(Guo et al., 2021) Fast Sequence Generation with Multi-Agent Reinforcement Learning
(Schmidt et al., 2018) Deep State Space Models for Unconditional Word Generation
(Ren et al., 2023) Unlocking the Power of GANs in Non-Autoregressive Text Generation