Distilled Decoding: Efficient AR Sampling
- Distilled Decoding is a novel technique that compresses the sequential sampling of autoregressive models into one or minimal forward passes using flow matching and conditional score distillation.
- It leverages two core frameworks—DD1 for deterministic mapping and DD2 for conditional score refinement—to drastically reduce inference time while closely approximating teacher model output quality.
- Experimental results on benchmarks like ImageNet and LAION-COCO demonstrate significant speed-ups (up to 238×) with only modest trade-offs in FID and minor visual artifacts.
Distilled Decoding (DD) refers to a new family of methodologies enabling few-step—often one-step—sampling from image autoregressive (AR) models, compressing the traditional sequential generation pipeline into a single or minimal set of forward passes. These techniques leverage flow-matching or conditional score distillation in latent embedding spaces to realize drastic acceleration in sample generation from state-of-the-art AR models, with rigorously quantified trade-offs in sample quality. The main frameworks are Distilled Decoding 1 (DD1), based on flow-matching, and Distilled Decoding 2 (DD2), grounded in conditional score distillation, each delineating progressively improved capabilities for one-step AR generation (Liu et al., 2024, Liu et al., 23 Oct 2025).
1. Motivation and Background
Autoregressive models factor the joint probability of a sequence as
imposing an inherently sequential sampling procedure. For high-dimensional data (e.g., images with – discrete tokens), this yields substantial inference latencies: state-of-the-art image AR models such as VAR and LlamaGen require between 10 and 256 iterative steps, translating to latencies of 0.1–5 seconds per 256×256 image on A100 GPUs (Liu et al., 2024). Historical fast-sampling approaches, including mask-based AR (e.g., MaskGIT, MAGE, MAR), blockwise/skip-token prediction, and speculative decoding, cannot fully collapse the generation process to one or two steps without substantial quality collapse, attributed to failure in capturing the sequential conditional dependencies intrinsic to AR factorization.
DD challenges the prevailing assumption that AR models are inextricably slow, demonstrating that it is possible to match or closely approximate the teacher model's output distribution in a dramatically reduced number of steps. The central problem addressed is how to compress a pre-trained AR model into a fast generator, with minimal sample quality degradation and without requiring the original training data.
2. Core Methodologies: Flow Matching and Conditional Score Distillation
2.1 Distilled Decoding 1 (Flow Matching)
DD1 constructs a deterministic mapping from an isotropic Gaussian prior to the original AR model's token-wise output space. This is achieved via a continuous-time flow matching ODE that maps each Gaussian sample to the output codebook embedding:
where , and are codebook embeddings. Integrating maps noise samples to token embeddings according to the AR model's local conditionals (Liu et al., 2024).
Instead of solving this ODE during inference, DD1 distills the entire flow for the sequence into a Transformer , trained to regress from noise sequences to data sequences using a composite loss:
- Cross-entropy (over codebook indices)
- LPIPS distance (on embedding regression)
2.2 Distilled Decoding 2 (Conditional Score Distillation)
DD2 improves on DD1 by eliminating the reliance on an explicit deterministic mapping. Instead, it defines a conditional score in latent embedding space for each AR token, expressed as:
The key innovation is formulating a conditional score distillation (CSD) loss at every AR token position:
where ; denotes the generator and the guidance network. This objective directly distills the teacher AR conditional scores at all positions, matched even under imperfect context, circumventing brittleness in memorizing a fixed mapping as in DD1 (Liu et al., 23 Oct 2025).
3. Network Architectures and Training Paradigms
Both DD1 and DD2 employ causal (decoder-only) Transformer backbones, initialized from the teacher AR model and adapted for continuous input modalities. Distinct input embedding schemes differentiate noise from data tokens, and a learnable class token is prepended. In DD1, output heads include both a -way classifier and a -dimensional regressor, with a head switch at an empirically selected intermediate step.
The DD2 workflow includes a guidance network trained to mimic the teacher's conditional scores, enabling on-the-fly score estimation for the generator . AR-diffusion tuning—training an MLP head to predict RectFlow conditional scores on top of the original teacher weights—is critical for convergence.
Training is conducted over millions of synthetic pairs generated from the teacher model. DD does not require access to the teacher's original training data. Optimization for DD1 uses AdamW (, , LR , EMA $0.9999$), with up to 120 epochs (VAR) or 70 (LlamaGen) (Liu et al., 2024). DD2 further achieves up to reduction in training cost versus DD1 (Liu et al., 23 Oct 2025).
4. Sampling Procedures and Pseudocode
DD enables three key sampling regimes:
- One-step generation: Draw and output (DD1), or (DD2).
- Two-step/hybrid generation: Interpolate between initial and intermediate steps before projection to output, optionally mixing transformer inference and teacher AR sampling to trade off quality and speed.
- Guided generation: DD2's guidance network provides conditional score estimates at every position, supporting both direct output and hybrid refinement strategies (Liu et al., 2024, Liu et al., 23 Oct 2025).
Pseudocode for main sampling procedures is given directly in (Liu et al., 2024, Liu et al., 23 Oct 2025), and can be summarized as:
1 2 3 4 5 6 7 |
Input: distilled model θ, codebook size n 1. Draw X = (ε1,…,εn) ∼ N(0,I) 2. Return F_θ( X, t=1 ) Input: generator G_θ 1. Draw ε = (ε1,…,εn) ∼ N(0,I) 2. Return G_θ(ε) |
5. Experimental Results
Experimental evaluation covers ImageNet-256 and LAION-COCO, using teacher AR models (VAR, LlamaGen) as baselines. Main outcomes:
| Model | Steps | FID | Params | Speed-up vs. teacher |
|---|---|---|---|---|
| VAR-d16 (teacher) | 10 | 4.19 | 310 M | 1× |
| VAR-d16-DD1 (1-step) | 1 | 9.94 | 327 M | 6.3× |
| VAR-d16-DD2 (1-step) | 1 | 5.43 | 600 M | 8.0× |
| LlamaGen-L (teacher) | 256 | 4.11 | 343 M | 1× |
| LlamaGen-L-DD1 (1-step) | 1 | 11.35 | 326 M | 217.8× |
| LlamaGen-L-DD2 (1-step) | 1 | 8.59 | 343 M | 238× |
For text-to-image generation (LAION-COCO), LlamaGen-T2I-DD1 achieves speed-up, with only a modest FID increase from 25.70 (teacher) to 28.95 (2-step DD1). Baseline methods such as skip-token and set-prediction result in FID , signifying qualitative collapse (Liu et al., 2024, Liu et al., 23 Oct 2025).
Ablation studies indicate that FID improves rapidly during initial training epochs, that the selection of intermediate step has only a minor effect on convergence, and that DD remains effective even with moderately reduced dataset sizes.
6. Quality-Performance Tradeoffs, Limitations, and Open Directions
The performance of DD approaches is fundamentally upper-bounded by the teacher AR model; a quality gap remains, with typical one-step DD1 exhibiting –$7$ FID versus the teacher, and DD2 reducing this gap by up to 67% (e.g., 1-step FID drops from 9.55 to 5.43 on ImageNet-256) (Liu et al., 23 Oct 2025). Residual artifacts such as minor blurriness may appear in DD samples, especially for fine-scale image detail.
Extending DD to very large vocabulary or extremely long sequences (e.g., in LLMs) remains an open challenge, due to scaling bottlenecks in codebook size and contextual representation dimensionality. A fully teacher-free distillation regime—possibly via consistency models or direct score-based modeling—has not yet been demonstrated for AR trajectories. The precise compute versus sample quality trade-off curve in AR models is unsettled. DD findings suggest that current AR sampling may be computationally sub-optimal, with significant redundancy in step-wise generation uncovered by collapse to one-step or two-step decoding.
7. Implications and Research Directions
- DD methodologies refute the notion that AR models are compelled to slow, stepwise generation and demonstrate that near-teacher quality is achievable via single forward passes.
- Conditional score distillation, as realized in DD2, offers a versatile mechanism for aligning generative models without explicit mapping memorization, and enables robust training by supplying gold-standard conditional scores even in off-nominal generation contexts.
- A plausible implication is that further advances in guidance network capacity, hybrid multi-step refinement, or the extension of these frameworks to continuous-space AR settings (e.g., diffusive token models) may narrow the quality gap still further.
- These results suggest new paradigms for hybrid generative models, leveraging the synergy of AR conditionals and score-based modeling.
Distilled Decoding thus forms a foundational technology for efficient, scalable autoregressive generation in high-dimensional domains, with broad applicability to vision and beyond (Liu et al., 2024, Liu et al., 23 Oct 2025).