Papers
Topics
Authors
Recent
Search
2000 character limit reached

Distilled Decoding: Efficient AR Sampling

Updated 15 February 2026
  • Distilled Decoding is a novel technique that compresses the sequential sampling of autoregressive models into one or minimal forward passes using flow matching and conditional score distillation.
  • It leverages two core frameworks—DD1 for deterministic mapping and DD2 for conditional score refinement—to drastically reduce inference time while closely approximating teacher model output quality.
  • Experimental results on benchmarks like ImageNet and LAION-COCO demonstrate significant speed-ups (up to 238×) with only modest trade-offs in FID and minor visual artifacts.

Distilled Decoding (DD) refers to a new family of methodologies enabling few-step—often one-step—sampling from image autoregressive (AR) models, compressing the traditional sequential generation pipeline into a single or minimal set of forward passes. These techniques leverage flow-matching or conditional score distillation in latent embedding spaces to realize drastic acceleration in sample generation from state-of-the-art AR models, with rigorously quantified trade-offs in sample quality. The main frameworks are Distilled Decoding 1 (DD1), based on flow-matching, and Distilled Decoding 2 (DD2), grounded in conditional score distillation, each delineating progressively improved capabilities for one-step AR generation (Liu et al., 2024, Liu et al., 23 Oct 2025).

1. Motivation and Background

Autoregressive models factor the joint probability of a sequence z=(q1,,qn)z = (q_1, \ldots, q_n) as

p(z)=i=1np(qiq<i),p(z) = \prod_{i=1}^{n} p(q_i \mid q_{<i}),

imposing an inherently sequential sampling procedure. For high-dimensional data (e.g., images with n102n \sim 10^210310^3 discrete tokens), this yields substantial inference latencies: state-of-the-art image AR models such as VAR and LlamaGen require between 10 and 256 iterative steps, translating to latencies of 0.1–5 seconds per 256×256 image on A100 GPUs (Liu et al., 2024). Historical fast-sampling approaches, including mask-based AR (e.g., MaskGIT, MAGE, MAR), blockwise/skip-token prediction, and speculative decoding, cannot fully collapse the generation process to one or two steps without substantial quality collapse, attributed to failure in capturing the sequential conditional dependencies intrinsic to AR factorization.

DD challenges the prevailing assumption that AR models are inextricably slow, demonstrating that it is possible to match or closely approximate the teacher model's output distribution in a dramatically reduced number of steps. The central problem addressed is how to compress a pre-trained AR model into a fast generator, with minimal sample quality degradation and without requiring the original training data.

2. Core Methodologies: Flow Matching and Conditional Score Distillation

2.1 Distilled Decoding 1 (Flow Matching)

DD1 constructs a deterministic mapping from an isotropic Gaussian prior π0(x)=N(0,I)\pi_0(x) = \mathcal{N}(0, I) to the original AR model's token-wise output space. This is achieved via a continuous-time flow matching ODE that maps each Gaussian sample to the output codebook embedding:

v(x,t)=j=1Vpj(cjx)exp(xtcj2(1t)2)j=1Vpjexp(xtcj2(1t)2),v^*(x, t) = \frac{\sum_{j=1}^{V} p_j (c_j - x) \exp\left(\frac{\|x - t c_j\|^2}{(1-t)^2}\right)}{\sum_{j=1}^{V} p_j \exp\left(\frac{\|x - t c_j\|^2}{(1-t)^2}\right)},

where pj=pΦ(qi=cjq<i)p_j = p_\Phi(q_i = c_j \mid q_{<i}), and {cj}\{c_j\} are codebook embeddings. Integrating x˙=v(x,t)\dot{x} = v^*(x, t) maps noise samples ϵiN(0,I)\epsilon_i \sim \mathcal{N}(0, I) to token embeddings according to the AR model's local conditionals (Liu et al., 2024).

Instead of solving this ODE during inference, DD1 distills the entire flow for the sequence into a Transformer FθF_\theta, trained to regress from noise sequences (ϵ1,,ϵn)(\epsilon_1, \ldots, \epsilon_n) to data sequences (q1,,qn)(q_1, \ldots, q_n) using a composite loss:

  • Cross-entropy (over codebook indices)
  • LPIPS distance (on embedding regression)

2.2 Distilled Decoding 2 (Conditional Score Distillation)

DD2 improves on DD1 by eliminating the reliance on an explicit deterministic mapping. Instead, it defines a conditional score in latent embedding space for each AR token, expressed as:

sΦ(xt,tq<i)=j=1Vpj(xt(1t)cj)exp(xt(1t)cj22t2)t2j=1Vpjexp(xt(1t)cj22t2).s_\Phi(x_t, t \mid q_{<i}) = -\frac{ \sum_{j=1}^{V} p_j (x_t - (1-t) c_j) \exp\left(- \frac{ \| x_t - (1-t) c_j \|^2 }{2 t^2}\right) }{ t^2 \sum_{j=1}^{V} p_j \exp\left(- \frac{ \| x_t - (1-t) c_j \|^2 }{2 t^2}\right) }.

The key innovation is formulating a conditional score distillation (CSD) loss at every AR token position:

LCSD(θ)=Eεi,tii=1nd(sΦ(qiti,tisg(q<i)),  sψ(qiti,tisg(q<i))),\mathcal{L}_{\mathrm{CSD}}(\theta) = \mathbb{E}_{\varepsilon_i, t_i} \sum_{i=1}^{n} d \left( s_\Phi(q_i^{t_i}, t_i \mid \mathrm{sg}(q_{<i})), \; s_\psi(q_i^{t_i}, t_i \mid \mathrm{sg}(q_{<i})) \right ),

where qiti=(1ti)qi+tiεiq_i^{t_i} = (1-t_i) q_i + t_i \varepsilon_i; GθG_\theta denotes the generator and sψs_\psi the guidance network. This objective directly distills the teacher AR conditional scores at all positions, matched even under imperfect context, circumventing brittleness in memorizing a fixed mapping as in DD1 (Liu et al., 23 Oct 2025).

3. Network Architectures and Training Paradigms

Both DD1 and DD2 employ causal (decoder-only) Transformer backbones, initialized from the teacher AR model and adapted for continuous input modalities. Distinct input embedding schemes differentiate noise from data tokens, and a learnable class token is prepended. In DD1, output heads include both a VV-way classifier and a CC-dimensional regressor, with a head switch at an empirically selected intermediate step.

The DD2 workflow includes a guidance network sψs_\psi trained to mimic the teacher's conditional scores, enabling on-the-fly score estimation for the generator GθG_\theta. AR-diffusion tuning—training an MLP head to predict RectFlow conditional scores on top of the original teacher weights—is critical for convergence.

Training is conducted over millions of synthetic pairs (noise sequence,decoded AR output)(\text{noise sequence}, \text{decoded AR output}) generated from the teacher model. DD does not require access to the teacher's original training data. Optimization for DD1 uses AdamW (β1=0.9\beta_1 = 0.9, β2=0.95\beta_2 = 0.95, LR 1×1041\times 10^{-4}, EMA $0.9999$), with up to 120 epochs (VAR) or 70 (LlamaGen) (Liu et al., 2024). DD2 further achieves up to 12.3×12.3\times reduction in training cost versus DD1 (Liu et al., 23 Oct 2025).

4. Sampling Procedures and Pseudocode

DD enables three key sampling regimes:

  • One-step generation: Draw X=(ϵ1,,ϵn)N(0,I)X = (\epsilon_1, \ldots, \epsilon_n) \sim \mathcal{N}(0, I) and output Fθ(X,t=1)F_\theta(X, t=1) (DD1), or Gθ(ϵ)G_\theta(\epsilon) (DD2).
  • Two-step/hybrid generation: Interpolate between initial and intermediate steps before projection to output, optionally mixing transformer inference and teacher AR sampling to trade off quality and speed.
  • Guided generation: DD2's guidance network sψs_\psi provides conditional score estimates at every position, supporting both direct output and hybrid refinement strategies (Liu et al., 2024, Liu et al., 23 Oct 2025).

Pseudocode for main sampling procedures is given directly in (Liu et al., 2024, Liu et al., 23 Oct 2025), and can be summarized as:

1
2
3
4
5
6
7
Input: distilled model θ, codebook size n
1. Draw X = (ε1,,εn)  N(0,I)
2. Return F_θ( X, t=1 )

Input: generator G_θ
1. Draw ε = (ε1,,εn)  N(0,I)
2. Return G_θ(ε)

5. Experimental Results

Experimental evaluation covers ImageNet-256 and LAION-COCO, using teacher AR models (VAR, LlamaGen) as baselines. Main outcomes:

Model Steps FID Params Speed-up vs. teacher
VAR-d16 (teacher) 10 4.19 310 M
VAR-d16-DD1 (1-step) 1 9.94 327 M 6.3×
VAR-d16-DD2 (1-step) 1 5.43 600 M 8.0×
LlamaGen-L (teacher) 256 4.11 343 M
LlamaGen-L-DD1 (1-step) 1 11.35 326 M 217.8×
LlamaGen-L-DD2 (1-step) 1 8.59 343 M 238×

For text-to-image generation (LAION-COCO), LlamaGen-T2I-DD1 achieves 93×\sim93\times speed-up, with only a modest FID increase from 25.70 (teacher) to 28.95 (2-step DD1). Baseline methods such as skip-token and set-prediction result in FID 100\gg 100, signifying qualitative collapse (Liu et al., 2024, Liu et al., 23 Oct 2025).

Ablation studies indicate that FID improves rapidly during initial training epochs, that the selection of intermediate step tt has only a minor effect on convergence, and that DD remains effective even with moderately reduced dataset sizes.

6. Quality-Performance Tradeoffs, Limitations, and Open Directions

The performance of DD approaches is fundamentally upper-bounded by the teacher AR model; a quality gap remains, with typical one-step DD1 exhibiting +5+5–$7$ FID versus the teacher, and DD2 reducing this gap by up to 67% (e.g., 1-step FID drops from 9.55 to 5.43 on ImageNet-256) (Liu et al., 23 Oct 2025). Residual artifacts such as minor blurriness may appear in DD samples, especially for fine-scale image detail.

Extending DD to very large vocabulary or extremely long sequences (e.g., in LLMs) remains an open challenge, due to scaling bottlenecks in codebook size and contextual representation dimensionality. A fully teacher-free distillation regime—possibly via consistency models or direct score-based modeling—has not yet been demonstrated for AR trajectories. The precise compute versus sample quality trade-off curve in AR models is unsettled. DD findings suggest that current AR sampling may be computationally sub-optimal, with significant redundancy in step-wise generation uncovered by collapse to one-step or two-step decoding.

7. Implications and Research Directions

  • DD methodologies refute the notion that AR models are compelled to slow, stepwise generation and demonstrate that near-teacher quality is achievable via single forward passes.
  • Conditional score distillation, as realized in DD2, offers a versatile mechanism for aligning generative models without explicit mapping memorization, and enables robust training by supplying gold-standard conditional scores even in off-nominal generation contexts.
  • A plausible implication is that further advances in guidance network capacity, hybrid multi-step refinement, or the extension of these frameworks to continuous-space AR settings (e.g., diffusive token models) may narrow the quality gap still further.
  • These results suggest new paradigms for hybrid generative models, leveraging the synergy of AR conditionals and score-based modeling.

Distilled Decoding thus forms a foundational technology for efficient, scalable autoregressive generation in high-dimensional domains, with broad applicability to vision and beyond (Liu et al., 2024, Liu et al., 23 Oct 2025).

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Distilled Decoding (DD).