Speculative Jacobi Decoding (SJD)

Updated 23 March 2026

Speculative Jacobi Decoding is a training-free, parallel probabilistic algorithm that transforms sequential autoregressive text-to-image generation into a multi-token iterative process.
It employs methods such as probabilistic drafting, masked forward passes, and rejection sampling to ensure that each token's output distribution exactly matches that of standard AR sampling.
Enhancements like SJD++, MC-SJD, and SJD-PAC boost step compression up to 7× while maintaining quality metrics such as FID and CLIP, making it effective for high-resolution generation.

Speculative Jacobi Decoding (SJD) is a class of training-free, parallel, probabilistic decoding algorithms for accelerating sampling from high-dimensional discrete autoregressive (AR) models, with particular emphasis on text-to-image generation. SJD algorithms transform sequential next-token AR inference into a multi-token iterative process, allowing multiple candidate tokens to be sampled and verified in parallel within fixed windows, while maintaining the exact distribution induced by stochastic decoding strategies (e.g., top-K sampling). The SJD framework, its theoretical underpinnings, algorithmic details, enhancements, and empirical behavior are outlined below.

1. Background and Motivation

AR text-to-image models tokenize both text prompts and target images, then generate image token sequences sequentially by sampling from the conditional distributions $p_\theta(x_i|x_{1:i-1})$ , where $\theta$ parameterizes a large transformer. High-resolution images require thousands of tokens, making conventional next-token generation prohibitively slow. Early acceleration methods—such as classical Jacobi decoding which updates all tokens in parallel via fixed-point iteration—are only compatible with greedy decoding and thus cannot preserve the stochasticity required for sample diversity in visual domains (Teng et al., 2024). Training-based approaches (e.g., model distillation, multi-token fine-tuning) add significant complexity and are not always practical for large models.

Speculative Jacobi Decoding addresses these limitations by introducing a probabilistic acceptance mechanism that supports sampling-based decoding within the Jacobi framework, enabling large inference speedups without additional training (Teng et al., 2024, Teng et al., 8 Dec 2025).

2. Core Principles and Algorithm

SJD operates by iteratively drafting and verifying multiple tokens in parallel across a Jacobi window. Given a partial sequence prefix (accepted tokens), each iteration proceeds as follows:

Drafting: Draft $W$ candidate tokens in parallel (windowed positions), using either random initialization, model-based sampling, or spatially-informed priors.
Parallel Decoding: For the current draft, perform a masked forward pass to compute conditional distributions $p_\theta(\cdot|$ draft context $)$ for each drafted position.
Probabilistic Verification: For draft token $x_i^{(j)}$ , with prior context from iteration $j-1$ , accept with probability

$A_i = \min\left(1, \frac{p_\theta(x_i^{(j)}|\,(\cdot)^{(j)})}{p_\theta(x_i^{(j)}|\,(\cdot)^{(j-1)})}\right)$

via rejection sampling (Teng et al., 2024). If rejected, resample from the calibrated residual distribution

$q_i(x) \propto \max(0,\,p_\theta(x|\,(\cdot)^{(j)}) - p_\theta(x|\,(\cdot)^{(j-1)}))$

Token Acceptance and Window Advancement: Accepted tokens are appended to the output prefix. Remaining tokens are retained or re-initialized in the window for subsequent iterations. Window slides to process subsequent unaccepted positions.

This process ensures that, at each step, the marginal output for each token is identical to that of standard AR sampling from the model, thereby preserving distributional fidelity (Teng et al., 2024, Teng et al., 8 Dec 2025). The approach is completely training-free and model-agnostic.

3. Practical Enhancements and Variants

Several variants and enhancements of SJD have been introduced:

SJD++: Introduces token-reuse for high-confidence tokens post-verification, dramatically increasing window throughput. Confidence ratio $C_i^{(j)} = p_\theta(x_i^{(j)}|j) / p_\theta(x_i^{(j)}|j-1)$ governs whether draft tokens are retained in subsequent iterations (Teng et al., 8 Dec 2025). Selective reuse aims to maximize the growth of the accepted prefix per iteration.
Spatially Guided Initialization: Token initialization strategies that draw from adjacent (left, above) neighbor tokens or their conditional distributions leverage spatial locality and accelerate convergence, especially for structured or repetitive imagery (Teng et al., 2024, Teng et al., 8 Dec 2025).
SJD-PAC (Proactive Drafting & Adaptive Continuation): Addresses acceptance bottlenecks in high-entropy regions by:
- Proactive Drafting: Expands local candidate tree-search near rejections to increase the chance of multi-token acceptances in the next iteration.
- Adaptive Continuation: After the first rejection in the window, continues to verify subsequent positions against stale context, preserving valid tokens and reducing unnecessary resampling (Kang et al., 19 Mar 2026).
- These combined methods significantly uplift average acceptance-length per step, with strict preservation of distributional correctness via rejection sampling.
MC-SJD (Maximal Coupling SJD): Replaces independent draft-token sampling with maximally coupled draws to maximize the probability of identical consecutive drafts, stabilizing context across iterations. Implemented via modified rejection sampling (MRS) or shared-Gumbel noise, this requires only a single-line change in vanilla SJD and increases convergence speed (up to $\theta$ 0 for images, $\theta$ 1 for videos) without altering the marginal output law (So et al., 28 Oct 2025).
SJD² (Speculative Jacobi-Denoising Decoding): Introduces diffusion-guided denoising into Jacobi iterations. Models are fine-tuned to predict next-clean-token from noise-perturbed embeddings; inference alternates explicit denoising steps and speculative token verification. This hybrid yields short, stable convergence trajectories akin to diffusion models while maintaining AR sampling distribution, delivering $\theta$ 2– $\theta$ 3 step compression (Teng et al., 10 Oct 2025).

4. Theoretical Properties and Guarantees

The central theoretical property of all SJD-class algorithms is losslessness: every iteration maintains the exact output distribution of the original autoregressive model under sampling. Formally, the speculative acceptance and calibrated resampling probabilities ensure that at every token position,

$\theta$ 4

Proofs are provided via marginalization arguments and rely on properties of rejection sampling and, for MC-SJD, maximal coupling theory (So et al., 28 Oct 2025). When proactive drafting or adaptive continuation are employed, proofs by induction establish that the full acceptance-resampling process preserves the output law for every sub-sequence (Kang et al., 19 Mar 2026).

A key practical consequence is that acceleration comes only from more aggressive prefix growth per step, never from approximate or lossy modifications, unless a variant such as lossy GSD or SJD2 is explicitly invoked. SJD is thus suitable for production where exact distributional preservation is required.

5. Computational Complexity and Empirical Performance

Standard AR decoding is $\theta$ 5 in forward passes (one per token). Jacobi-style parallelism, when naively applied, can exceed AR performance only for greedy decoding. SJD modifies this paradigm:

With window size $\theta$ 6 and average accepted tokens per iteration $\theta$ 7, one expects roughly $\theta$ 8 iterations and corresponding forward passes. Vanilla SJD typically realizes $\theta$ 9, yielding $W$ 0 speedup. With SJD++, MC-SJD, or SJD-PAC, $W$ 1 increases further, affording $W$ 2– $W$ 3 step compression and $W$ 4– $W$ 5 wall-clock latency reduction (Teng et al., 2024, Teng et al., 8 Dec 2025, Kang et al., 19 Mar 2026, So et al., 28 Oct 2025).

Benchmark results consistently show:

Configuration	Step Compression	Latency Speedup	FID (↓)	CLIP (↑)
Baseline AR	1×	1×	30.76	31.29
SJD	~2.2×	~2.0×	30.85	31.35
SJD++	6.44×	3.12×	31.48	31.52
SJD-PAC	4.51×	3.80×	30.69	31.21
MC-SJD	4.2×	3.8×	30.83	32.81
SJD²	4.02×	2.81×	31.40	31.80

These results are robust across model architectures (Lumina-mGPT, Emu3, LlamaGen, Janus-Pro), resolutions, and prompt types, and do not degrade output visual quality, diversity (top-K), or alignment metrics (FID, CLIP, GenEval, HPSv2) (Teng et al., 2024, Teng et al., 8 Dec 2025, Kang et al., 19 Mar 2026, Teng et al., 10 Oct 2025, So et al., 28 Oct 2025). MC-SJD and techniques like proactive drafting are particularly effective for long sequences and high-entropy token regions.

6. Algorithmic Extensions and Ablations

The SJD research program includes several ablations and protocol choices:

Initialization Strategies: Random, neighbor copy, or neighbor-based distributional initialization; spatially-guided methods accelerate convergence, especially for structured visual data (Teng et al., 2024, Teng et al., 8 Dec 2025).
Window Size: Practically, window sizes $W$ 6 are needed for significant speedup. Larger windows with MC-SJD or SJD-PAC further improve throughput (So et al., 28 Oct 2025, Kang et al., 19 Mar 2026).
Step Compression and Acceptance: Ablations on acceptance-length (tokens per step), rejection causes, and context stability show that maximal coupling and proactive drafting reshape the prefix-growth distribution, ensuring bursty acceptance events and substantially reducing iterations in practice (So et al., 28 Oct 2025, Kang et al., 19 Mar 2026).
Lossy Variants: Optional lossy (approximate) variants (e.g., GSD, SJD2) can further increase step compression but may slightly increase FID or reduce CLIP-Score. These are typically not used when strict losslessness is mandated (Kang et al., 19 Mar 2026, Teng et al., 10 Oct 2025).

7. Significance, Limitations, and Outlook

Speculative Jacobi Decoding and its variants constitute the first class of training-free parallel decoding algorithms provably compatible with sampling-based AR generation—enabling inference acceleration even under highly non-greedy, diverse sampling regimes (Teng et al., 2024, Teng et al., 8 Dec 2025).

Key limitations include:

For highly entropic generative models, local acceptance rates may bottleneck, requiring further algorithmic improvements (addressed by SJD-PAC and MC-SJD) (Kang et al., 19 Mar 2026, So et al., 28 Oct 2025).
Denoising-based variants (SJD2) require lightweight training/fine-tuning for next-clean-token prediction, but substantially improve Jacobi trajectory stability for long or complicated sequences (Teng et al., 10 Oct 2025).
Fine-tuning for multi-token prediction, or integrating learned spatial/temporal priors, are plausible research directions for further efficiency gains (Teng et al., 2024, Teng et al., 8 Dec 2025).

SJD has been applied to both image and video AR generation, routinely achieving $W$ 7– $W$ 8 acceleration for multi-billion-parameter models on industry-standard visual generation benchmarks, with potential applicability to other long-sequence generative domains (Teng et al., 8 Dec 2025, So et al., 28 Oct 2025).