Papers
Topics
Authors
Recent
Search
2000 character limit reached

Sequential Monte Carlo Speculative Decoding (SMC-SD)

Updated 4 July 2026
  • SMC-SD is a particle-based, approximate decoding method that replaces per-token rejection sampling with importance weighting and resampling for efficient inference.
  • It maintains a population of particles to generate multiple draft continuations, reweighting them to control approximation error while preserving theoretical guarantees.
  • By vectorizing verification and exploiting idle GPU compute, SMC-SD achieves substantial speedups over traditional autoregressive and speculative decoding methods.

Searching arXiv for papers on Sequential Monte Carlo speculative decoding and closely related decoding methods. Sequential Monte Carlo speculative decoding (SMC-SD) is an approximate, particle-based variant of speculative decoding designed to exploit idle compute during LLM inference. It replaces speculative decoding’s per-token rejection sampling with importance-weighted resampling over many draft continuations, turning verification into a fixed-size, vectorized operation that runs efficiently on GPUs. In the formulation introduced in "Faster LLM Inference via Sequential Monte Carlo" (Emara et al., 17 Apr 2026), SMC-SD trades exactness for additional speed while preserving theoretical bounds on its per-step approximation error. Closely related work also places multi-trajectory decoding in an explicit Sequential Monte Carlo framework for masked diffusion LLMs, where model-internal confidence acts as a trajectory-level reward; that formulation is not speculative decoding per se, but it maps naturally onto speculative or multi-proposal decoding for autoregressive LLMs (Luo et al., 2 Feb 2026).

1. Problem setting and conceptual motivation

An autoregressive LLM defines

pθ(x)=pθ(EOSx<T+1)t=1Tpθ(xtx<t),p_\theta(\mathbf{x}) = p_\theta(\text{EOS} \mid \mathbf{x}_{<T+1}) \prod_{t=1}^T p_\theta(x_t \mid \mathbf{x}_{<t}),

so sampling is inherently sequential because each next token requires a forward pass conditioned on all previous tokens. In the setting emphasized for SMC-SD, these forward passes are predominantly memory-bandwidth-bound: the arithmetic intensity of a typical single-token forward pass is approximately

AIAR2b,\text{AI}_{\mathrm{AR}} \approx \frac{2}{b},

for bb bytes per parameter, which is far below the GPU’s roofline ridge point. The principal bottleneck is therefore repeated weight loading together with token-by-token dependence, rather than raw FLOPs (Emara et al., 17 Apr 2026).

Classical speculative decoding addresses this by pairing a large target model pp with a smaller proposal model qq. The draft model proposes a block of KK tokens, and the target model verifies them in one pass using token-level rejection sampling, with acceptance probability

αj=min ⁣(1,  p(yjxy<j)q(yjxy<j)).\alpha_j = \min\!\left(1,\; \frac{p(y_j \mid \mathbf{x}\,y_{<j})}{q(y_j \mid \mathbf{x}\,y_{<j})}\right).

This procedure is exact, but its speed-up is stochastic. When the draft diverges from the target, the draft block is truncated at the first rejection, later draft tokens are discarded, rollback occurs, and throughput collapses toward standard autoregressive decoding. SMC-SD is motivated by the observation that the same verification computation can instead support a population of speculative continuations, with disagreement handled by reweighting rather than outright rejection (Emara et al., 17 Apr 2026).

A recurrent misconception is that any replacement of rejection by weighting remains exact. The SMC-SD formulation explicitly does not make that claim. Standard speculative decoding remains an exact rejection sampler, whereas SMC-SD is a biased but controlled importance sampler whose accuracy improves with particle count (Emara et al., 17 Apr 2026).

2. Particle formulation and decoding procedure

SMC-SD maintains a population of particles

{(x(n),w(n))}n=1N,\{ (\mathbf{x}^{(n)}, w^{(n)}) \}_{n=1}^N,

where x(n)\mathbf{x}^{(n)} is a partial sequence and w(n)w^{(n)} is its importance weight. The target model is AIAR2b,\text{AI}_{\mathrm{AR}} \approx \frac{2}{b},0, the proposal or draft model is AIAR2b,\text{AI}_{\mathrm{AR}} \approx \frac{2}{b},1, and the working assumption is AIAR2b,\text{AI}_{\mathrm{AR}} \approx \frac{2}{b},2, so that if AIAR2b,\text{AI}_{\mathrm{AR}} \approx \frac{2}{b},3 then AIAR2b,\text{AI}_{\mathrm{AR}} \approx \frac{2}{b},4. From each particle prefix AIAR2b,\text{AI}_{\mathrm{AR}} \approx \frac{2}{b},5, the draft model generates a block of AIAR2b,\text{AI}_{\mathrm{AR}} \approx \frac{2}{b},6 tokens,

AIAR2b,\text{AI}_{\mathrm{AR}} \approx \frac{2}{b},7

and the target model evaluates the corresponding block likelihood ratio

AIAR2b,\text{AI}_{\mathrm{AR}} \approx \frac{2}{b},8

The particle weight update is therefore

AIAR2b,\text{AI}_{\mathrm{AR}} \approx \frac{2}{b},9

followed by normalization

bb0

These normalized weights define an empirical importance distribution over prefixes,

bb1

As bb2, bb3 converges to the true target distribution bb4 (Emara et al., 17 Apr 2026).

Resampling replaces rejection as the central corrective mechanism. When weights become imbalanced, SMC-SD computes the effective sample size

bb5

and, if bb6, resamples ancestor indices from the categorical distribution induced by bb7, resets weights to bb8, and propagates the selected ancestors. A single batched target pass also produces one bonus token

bb9

so that every particle is extended by exactly pp0 tokens per round. There is no per-token accept/reject chain and no rollback; disagreement between pp1 and pp2 lowers a particle’s weight and therefore its chance of surviving resampling, rather than truncating the block (Emara et al., 17 Apr 2026).

In this sense, SMC-SD is a specialization of Sequential Importance Resampling or particle filtering in which particles are sequence prefixes, the proposal is the draft model pp3, the target is the LLM pp4, and each round performs propose, weight, and optionally resample (Emara et al., 17 Apr 2026).

3. Approximation theory and error control

The theoretical analysis of SMC-SD is framed through the importance resampling distribution pp5. Let pp6 and let pp7. Under the assumptions pp8 and pp9, the paper establishes per-round control of the approximation error:

qq0

qq1

and

qq2

The interpretation given is direct: per-step approximation error decays like qq3 in qq4 and like qq5 in total variation, with constants governed by the draft–target mismatch. Increasing the number of particles tightens the approximation, while better proposal quality reduces the divergence term (Emara et al., 17 Apr 2026).

The same analysis relates effective sample size to draft length. For block length qq6, if qq7 and qq8 denote the corresponding block distributions, then in the large-qq9 limit

KK0

In an i.i.d.-token toy model,

KK1

so the ESS fraction behaves like KK2. This yields the main practical heuristic: for a fixed draft–target mismatch, larger draft blocks degrade ESS exponentially, so maintaining a stable resampling regime requires either smaller KK3 or larger KK4 (Emara et al., 17 Apr 2026).

The distinction from classical speculative decoding follows immediately. Classical speculative decoding has an exactness guarantee inherited from rejection sampling. SMC-SD instead provides controlled approximation. If KK5, the method degenerates to a single trajectory with importance weighting, but it is not exact speculative decoding. Full multi-round analysis is explicitly described as more complex because resampling introduces correlations across rounds (Emara et al., 17 Apr 2026).

4. Computational structure and hardware alignment

The systems argument for SMC-SD is that LLM inference is memory-bandwidth-bound, so the arithmetic needed to draft particles and to score them in parallel comes nearly for free. Verification becomes a fixed-size batched operation over exactly KK6 tokens, where KK7 is the number of requests, KK8 the number of particles per request, and KK9 the draft length. The target-pass arithmetic intensity is

αj=min ⁣(1,  p(yjxy<j)q(yjxy<j)).\alpha_j = \min\!\left(1,\; \frac{p(y_j \mid \mathbf{x}\,y_{<j})}{q(y_j \mid \mathbf{x}\,y_{<j})}\right).0

and the full draft-and-verify cycle has

αj=min ⁣(1,  p(yjxy<j)q(yjxy<j)).\alpha_j = \min\!\left(1,\; \frac{p(y_j \mid \mathbf{x}\,y_{<j})}{q(y_j \mid \mathbf{x}\,y_{<j})}\right).1

where αj=min ⁣(1,  p(yjxy<j)q(yjxy<j)).\alpha_j = \min\!\left(1,\; \frac{p(y_j \mid \mathbf{x}\,y_{<j})}{q(y_j \mid \mathbf{x}\,y_{<j})}\right).2 is the proposal-to-target parameter ratio. Compared with standard speculative decoding, SMC-SD increases arithmetic intensity roughly by a factor αj=min ⁣(1,  p(yjxy<j)q(yjxy<j)).\alpha_j = \min\!\left(1,\; \frac{p(y_j \mid \mathbf{x}\,y_{<j})}{q(y_j \mid \mathbf{x}\,y_{<j})}\right).3, because many more tokens are scored per target weight load (Emara et al., 17 Apr 2026).

The roofline analysis separates two regimes. In the memory-bound regime, where αj=min ⁣(1,  p(yjxy<j)q(yjxy<j)).\alpha_j = \min\!\left(1,\; \frac{p(y_j \mid \mathbf{x}\,y_{<j})}{q(y_j \mid \mathbf{x}\,y_{<j})}\right).4 for ridge-point batch size αj=min ⁣(1,  p(yjxy<j)q(yjxy<j)).\alpha_j = \min\!\left(1,\; \frac{p(y_j \mid \mathbf{x}\,y_{<j})}{q(y_j \mid \mathbf{x}\,y_{<j})}\right).5, a single SMC-SD step takes time approximately αj=min ⁣(1,  p(yjxy<j)q(yjxy<j)).\alpha_j = \min\!\left(1,\; \frac{p(y_j \mid \mathbf{x}\,y_{<j})}{q(y_j \mid \mathbf{x}\,y_{<j})}\right).6 while producing αj=min ⁣(1,  p(yjxy<j)q(yjxy<j)).\alpha_j = \min\!\left(1,\; \frac{p(y_j \mid \mathbf{x}\,y_{<j})}{q(y_j \mid \mathbf{x}\,y_{<j})}\right).7 tokens, so the speed-up is

αj=min ⁣(1,  p(yjxy<j)q(yjxy<j)).\alpha_j = \min\!\left(1,\; \frac{p(y_j \mid \mathbf{x}\,y_{<j})}{q(y_j \mid \mathbf{x}\,y_{<j})}\right).8

with αj=min ⁣(1,  p(yjxy<j)q(yjxy<j)).\alpha_j = \min\!\left(1,\; \frac{p(y_j \mid \mathbf{x}\,y_{<j})}{q(y_j \mid \mathbf{x}\,y_{<j})}\right).9. This expression is independent of {(x(n),w(n))}n=1N,\{ (\mathbf{x}^{(n)}, w^{(n)}) \}_{n=1}^N,0, which formalizes the claim that particles are nearly free while the system remains below the ridge. In the compute-bound regime,

{(x(n),w(n))}n=1N,\{ (\mathbf{x}^{(n)}, w^{(n)}) \}_{n=1}^N,1

so speed degrades linearly in {(x(n),w(n))}n=1N,\{ (\mathbf{x}^{(n)}, w^{(n)}) \}_{n=1}^N,2 after the ridge is crossed (Emara et al., 17 Apr 2026).

The implementation described for SMC-SD uses dynamic batching together with PagedAttention and RadixAttention. Resampling does not move KV-cache tensors; it swaps page-level pointers and reference counts, making the operation essentially metadata manipulation rather than tensor copying. The reported KV-cache reduction for Llama-1B{(x(n),w(n))}n=1N,\{ (\mathbf{x}^{(n)}, w^{(n)}) \}_{n=1}^N,38B with {(x(n),w(n))}n=1N,\{ (\mathbf{x}^{(n)}, w^{(n)}) \}_{n=1}^N,4 and {(x(n),w(n))}n=1N,\{ (\mathbf{x}^{(n)}, w^{(n)}) \}_{n=1}^N,5 is 72.3%. Typical hyperparameter ranges in the experiments are {(x(n),w(n))}n=1N,\{ (\mathbf{x}^{(n)}, w^{(n)}) \}_{n=1}^N,6, often {(x(n),w(n))}n=1N,\{ (\mathbf{x}^{(n)}, w^{(n)}) \}_{n=1}^N,7, {(x(n),w(n))}n=1N,\{ (\mathbf{x}^{(n)}, w^{(n)}) \}_{n=1}^N,8, and an ESS threshold {(x(n),w(n))}n=1N,\{ (\mathbf{x}^{(n)}, w^{(n)}) \}_{n=1}^N,9 (Emara et al., 17 Apr 2026).

5. Empirical performance and operating regimes

The main empirical claim is that SMC-SD can accelerate inference substantially while remaining close to target-model quality. On a Llama-1B x(n)\mathbf{x}^{(n)}0 Llama-70B pair, the reported throughputs are approximately 65 tok/s for autoregressive decoding, 141 tok/s for optimized speculative decoding, 225 tok/s for speculative speculative decoding, and 342 tok/s for SMC-SD. This corresponds to a x(n)\mathbf{x}^{(n)}1 speed-up over autoregressive decoding and a x(n)\mathbf{x}^{(n)}2 speed-up over optimized speculative decoding. On Llama-1Bx(n)\mathbf{x}^{(n)}38B and Qwen-0.5/3Bx(n)\mathbf{x}^{(n)}414B model pairs, SMC-SD lies above and to the right of speculative decoding on speed–accuracy Pareto plots for GSM8K, MATH500, AlpacaEval, and DS1000 (Emara et al., 17 Apr 2026).

At an iso-accuracy operating point defined as within 3 percentage points of speculative decoding accuracy, SMC-SD achieves 1.1–2.5x(n)\mathbf{x}^{(n)}5 throughput relative to speculative decoding. At a more aggressive point defined as within 10 percentage points, the reported gain is up to 3.4x(n)\mathbf{x}^{(n)}6. Despite being approximate, it remains within about 3% of target accuracy on reasoning, instruction-following, and coding benchmarks, while more aggressive configurations stay within 10–15% (Emara et al., 17 Apr 2026).

Temperature behavior distinguishes SMC-SD from rejection-based speculative decoding. Because speculative decoding’s acceptance rate falls as temperature increases, its throughput deteriorates when the draft and target diverge. SMC-SD, by contrast, always produces x(n)\mathbf{x}^{(n)}7 tokens per round, so throughput is described as essentially temperature-invariant. Empirically, at low temperature x(n)\mathbf{x}^{(n)}8, it is roughly 1.5–2.3x(n)\mathbf{x}^{(n)}9 faster than speculative decoding; at w(n)w^{(n)}0, it is 2–3w(n)w^{(n)}1 faster, while negative log-likelihood under the target model increases by only about 5% (Emara et al., 17 Apr 2026).

These results make the central trade-off explicit. Larger w(n)w^{(n)}2 yields more tokens per target call but heavier importance weights and worse ESS. Larger w(n)w^{(n)}3 improves approximation quality, but only until the hardware crosses from memory-bound to compute-bound execution. The reported Pareto frontiers are therefore parameterized jointly by proposal quality, particle count, draft length, and hardware utilization (Emara et al., 17 Apr 2026).

6. Relation to adjacent SMC-style decoding frameworks

The most direct conceptual precursor outside autoregressive speculative decoding is self-rewarding sequential Monte Carlo for masked diffusion LLMs. In that setting, the reverse diffusion path is treated as a latent trajectory, the proposal is the learned reverse kernel, and the Feynman–Kac potential is the joint probability of newly accepted tokens. Under a bootstrap proposal, the incremental SMC weight becomes exactly the product of token-level confidences for the newly accepted tokens, a quantity called trajectory-level confidence. That work is explicitly not about speculative decoding per se, but it gives a clean SMC formulation of multi-trajectory decoding and suggests a direct speculative analogue in which a draft model serves as proposal and the target model defines the potential through blockwise likelihood ratios or target scores (Luo et al., 2 Feb 2026).

A second line of related work studies speculative decoding for multi-sample inference without an auxiliary draft model. "Speculative Decoding for Multi-Sample Inference" builds a draft pool from parallel reasoning paths of the same LLM, organizes candidate continuations as a weighted DAG, and extracts a consensus draft sequence by a probabilistic aggregation mechanism. The method is not named SMC-SD, but the paper explicitly maps its ingredients to SMC-like ideas: trajectories behave like particles, cross-path suffix matches define local state neighborhoods, and DAG edge weights combine model likelihood with empirical support from the particle ensemble. Verification remains standard speculative decoding, so the final distribution is exact (Li et al., 7 Mar 2025).

Rollback-aware branch methods provide a different comparison point. SpecBranch introduces branch parallelism by identifying low-confidence branch points, spawning multiple candidate continuations via Top-w(n)w^{(n)}4 sampling, and selecting among them after target verification. The paper models accepted draft length with a truncated geometric distribution, analyzes latency under rollback, and reports over w(n)w^{(n)}5 speedups against auto-regressive decoding together with a 50% reduction in rollback tokens for poorly aligned models. Its branches are analogous to particles, but the procedure uses hard selection rather than explicit particle weights or multi-particle survival (Shen et al., 16 May 2025).

Speculative speculative decoding (SSD) provides yet another neighboring design. SSD and its optimized instantiation Saguaro run draft and target on separate devices, predict likely verification outcomes while verification is still in progress, and precompute speculative continuations in a cache keyed by those outcomes. The method is exact because it preserves standard speculative decoding verification. Although it is not formulated in SMC language, it is structurally close to transient particle branching: multiple future hypotheses are propagated in parallel, only the branch matching the realized verification outcome is retained, and the rest are discarded. The paper reports up to w(n)w^{(n)}6 speed over optimized speculative decoding baselines and up to w(n)w^{(n)}7 over autoregressive decoding (Kumar et al., 3 Mar 2026).

Taken together, these lines of work suggest a broader interpretation of SMC-SD as one member of a family of multi-trajectory decoding methods. The distinctive feature of SMC-SD is that it replaces exact rejection or greedy branch commitment with explicit importance weighting and resampling over a particle population.

7. Limitations, misconceptions, and open directions

The primary limitation of SMC-SD is the one stated in its definition: it is approximate rather than exact. The theoretical guarantees are per-round bias and mean-square-error bounds, not a complete end-to-end characterization over many resampling rounds. Full multi-round analysis is described as more complex because of resampling-induced correlations. A second limitation is weight degeneracy. If the proposal and target are very poorly aligned, ESS decays quickly with w(n)w^{(n)}8; practical remedies are to decrease draft length or increase particle count, but both adjustments affect the speed–accuracy trade-off (Emara et al., 17 Apr 2026).

A related misconception is that the availability of w(n)w^{(n)}9 ratios automatically makes the scheme exact. In SMC-SD, the ratios are used to define importance weights, not to implement token-level rejection sampling. The method therefore preserves controlled approximation rather than exact sampling from the target. Conversely, it would also be mistaken to view SMC-SD as merely beam search under another name. Beam search is deterministic top-AIAR2b,\text{AI}_{\mathrm{AR}} \approx \frac{2}{b},00 pruning by accumulated log-probability, whereas SMC-SD is a stochastic particle system with normalized weights and ESS-triggered resampling (Emara et al., 17 Apr 2026).

Open design questions arise from the neighboring literature. Consensus-based multi-sample drafting notes that probabilistic aggregation over a large number of parallel paths incurs non-negligible overhead, so scaling internal particle proposals remains a systems problem (Li et al., 7 Mar 2025). Rollback-aware branch parallelism shows that branch management overhead and offline adaptation modules can matter, especially when draft–target alignment changes across tasks or temperatures (Shen et al., 16 May 2025). Asynchronous multi-branch caching in SSD shows that custom attention masks, fan-out allocation, and fallback strategies become decisive at larger batch sizes (Kumar et al., 3 Mar 2026). A plausible implication is that future SMC-SD systems will need to combine statistical controls such as ESS and divergence-aware weighting with hardware-aware scheduling, prefix sharing, and adaptive branching policies rather than treating decoding quality and serving efficiency as separable concerns.

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Sequential Monte Carlo Speculative Decoding (SMC-SD).