Papers
Topics
Authors
Recent
2000 character limit reached

Speculative Sampling with Token-Level Drafting

Updated 5 February 2026
  • The paper introduces a novel acceleration technique that leverages a lightweight draft model and optimal transport for direct token-level verification, maintaining output fidelity.
  • It employs an algorithmic reduction from an intractable OT linear program to a tractable convex optimization, attaining high acceptance rates with low latency per token.
  • The approach is scalable and compatible with hardware optimizations, making it practical for production LLM pipelines with minimal statistical deviation.

Speculative sampling with direct token-level drafting is a family of acceleration techniques for LLM autoregressive decoding, wherein a lightweight draft model proposes multiple future token candidates per iteration, and the full target LLM verifies which candidates align with its own distribution. This paradigm combines efficient approximate drafting with exact or nearly-exact marginalization over the target distribution, yielding significant throughput improvements while provably maintaining output fidelity. The “multi-draft” extension formalizes token-level candidate verification as an instance of optimal transport (OT), and recent breakthroughs have reduced the intractable OT formulations to tractable convex or combinatorial optimization. This approach is now central to modern LLM inference at scale, as it addresses both computational and quality bottlenecks for large-vocabulary, long-context, and production settings.

1. Problem Formulation and Optimal Transport Foundations

At each decoding step, speculative sampling introduces a small draft model qq which, given the current context, proposes an nn-tuple of candidate tokens y=(y1,,yn)q(n)y = (y_1, \ldots, y_n) \sim q^{(n)}. The main objective is to design a verification distribution π(xy)\pi(x|y) with two key properties: (a) it couples the draft and target distributions by ensuring yπ(x,y)=p(x)\sum_{y} \pi(x, y) = p(x) (target marginal) and xπ(x,y)=q(n)(y)\sum_{x} \pi(x, y) = q^{(n)}(y) (draft marginal); and (b) it maximizes the probability of accepting a draft token, i.e., xx is one of (y1,,yn)(y_1, \ldots, y_n).

This framework can be stated as an optimal transport linear program (OT-LP): maxCx,yyVnxset(y)Cx,y s.t.yVnCx,y=p(x)xV xVCx,y=q(n)(y)yVn Cx,y0\begin{aligned} &\max_{C_{x, y}} \sum_{y \in V^n} \sum_{x \in \mathrm{set}(y)} C_{x, y} \ \text{s.t.}\quad &\sum_{y \in V^n} C_{x, y} = p(x) \quad\forall x \in V \ &\sum_{x \in V} C_{x, y} = q^{(n)}(y) \quad\forall y \in V^n \ &C_{x, y} \geq 0 \end{aligned} where VV is the vocabulary, and set(y)={y1,,yn}\mathrm{set}(y) = \{y_1,\ldots,y_n\}. The optimal value yields the maximal multi-draft acceptance rate. However, this LP involves O(Vn+1)O(V^{n+1}) variables and becomes computationally infeasible for realistic V|V| and moderate nn (Thomas et al., 19 Nov 2025, Khisti et al., 2024, Hu et al., 26 Feb 2025, Sun et al., 2023).

2. Algorithmic Reduction: From OT-LP to Efficient Token-Level Verification

Substantial progress has been made in reducing the complexity of the OT-LP for practical deployment:

  • Relaxed OT-LP and Submodular Dual Reduction: By relaxing constraints to inequalities (substituting equalities by \leq) and dualizing (applying total unimodularity), the dual problem reduces to binary variables associated with token subsets. The acceptance rate becomes 1+minHVψ(H)1 + \min_{H \subset V} \psi(H) with ψ(H)=xHp(x)yHnq(n)(y)\psi(H) = \sum_{x \in H} p(x) - \sum_{y \in H^n} q^{(n)}(y), and the minimization leverages the submodular structure of ψ\psi under i.i.d. draft sampling (Thomas et al., 19 Nov 2025, Hu et al., 26 Feb 2025, Khisti et al., 2024).
  • Max-Flow and Polymatroid Interpretation: The relaxed LP is equivalent to a max-flow over a bipartite graph linking tokens and draft tuples, but further analysis shows the problem can be mapped to a convex minimization in O(V)O(V) variables. Polymatroid theory enables efficient calculation of residuals and subset blockwise structure, which forms the core of the “Global Resolution” algorithm (Thomas et al., 19 Nov 2025).
  • Convex Optimization Block Structure: The optimal solution exhibits block structure: for xHx \in H^* and y(H)ny \in (H^*)^n, a convex solver (Levenberg-Marquardt or L-BFGS) computes log-space parameters over small truncation subsets TT, leading to fast inference under practical runtime budgets (Thomas et al., 19 Nov 2025).
  • Approximate Schemes: For higher nn or large VV, initial works proposed efficient greedy or sequential algorithms with multiplicative or additive approximation guarantees—e.g., the kk-Seq algorithm in SpecTr guarantees at least a (11/e)(1-1/e)-fraction of the optimal acceptance probability (Sun et al., 2023).

3. Empirical Performance and Trade-offs

The Global Resolution algorithm demonstrates substantial empirical gains:

  • Acceptance Rate Scaling: For typical i.i.d. drafts from a base model and top-kk truncation, increasing nn from 2 to 5 boosts acceptance rates by 2–4 percentage points per draft. Acceptance rates >90%>90\% are observed with n=4n=4, k=1000k=1000 (Thomas et al., 19 Nov 2025).
  • Runtime: General LP/max-flow solvers are intractable beyond k100k \sim 100; the convex reduction approach (with T50|T| \sim 50 for n=2n=2, T10|T| \sim 10 for n=4n=4) gives <100<100 ms per token, and as low as <10<10 ms for moderate n,kn, k.
  • Quality: Deviation from the target distribution is negligible (L1L_1 error 0.015\lesssim 0.015 for tolerance τ=103\tau=10^{-3}). Hardware measurements confirm full compatibility with LLM serving pipelines.

A selection of empirical results is summarized below:

Task/Setting n k Acceptance Overhead (ms/tok)
Llama-3 70B 4 1000 95% <100
Gemma-2 27B 2 50 90%+ <10

Source: (Thomas et al., 19 Nov 2025)

Increasing kk (the number of draft tokens considered per step) initially raises acceptance but exhibits diminishing returns beyond k1000k\gtrsim 1000. Tuning nn and kk for the model and hardware budget is essential to optimize trade-offs.

4. Practical Implementation and Recommendations

The efficient speculative sampler via Global Resolution follows this workflow:

  1. Sort tokens by decreasing q(x)/p(x)q(x)/p(x), scan to find the minimizing subset HH^*.
  2. Compute prefix minima of ψ\psi (for residuals pxp_x) using the polymatroid structure.
  3. Blockwise convex solve:
    • Inner system: If yy is wholly within HH^*, minimize Θ(β)\Theta(\beta) over truncation THT\subset H^*.
    • Outer system: If yy is outside HH^*, minimize Φ(α)\Phi(\alpha) over TVHT\subset V\setminus H^*.
  4. Build Cx,yC_{x,y}: From softmax transport with found α\alpha or β\beta.
  5. Sample and accept: For each draft yy, compute and normalize π(xy)\pi(x|y). Accept if xset(y)x \in \mathrm{set}(y).

Tuning guidelines (Thomas et al., 19 Nov 2025):

  • n=3n=3–5 achieves optimal overhead/acceptance trade-off in most settings.
  • k=100k=100–1000 yields near-saturating acceptance for moderate nn.
  • Set solver tolerance τ\tau to 10310^{-3} for <100<100 ms latency and negligible statistical deviation.
  • Empirically, 90%\geq 90\% of tokens can be sampled in milliseconds on modern hardware.

The method is compatible with hardware optimizations (quantization, multi-query attention), multi-token block or tree drafting, and can be integrated into both research and production LLM pipelines.

5. Connections to Prior Work and Theoretical Limits

The multi-draft OT framework encompasses and extends previous speculative sampling schemes:

The fundamental acceptance rate achieved is both a function of the overlap of pp and qq and the diversity of the draft candidates. The maximal theoretical acceptance for a given p,q,np, q, n is determined by the optimal transport solution; practical algorithms now achieve acceptance rates within $1$–$2$ points of this optimal (Thomas et al., 19 Nov 2025, Hu et al., 26 Feb 2025).

6. Limitations and Future Directions

Despite advances, several open challenges and limitations remain:

  • Scaling to ultra-large vocabularies: While convex reduction is efficient, very large V|V| or large nn can challenge memory and compute budgets.
  • Adaptive/heterogeneous draft schemes: Current theoretical guarantees assume i.i.d. draft samples; heterogeneous or context-dependent drafting requires further generalization, possibly combining context clustering as in DynaSpec or output-space pruning (Zhang et al., 11 Oct 2025).
  • Integration with hardware-aware pipelines: Realizing the theoretical gains in end-to-end systems requires parallelization and efficient kernel implementation, especially for draft model heads and verification (Zhao et al., 20 Feb 2025, Zhang et al., 11 Oct 2025).
  • Multi-step/tree-topology decoding: Extending OT-based optimality beyond single-step token-level drafting to tree or sequence-level speculative sampling is an active direction (Huang et al., 4 Jun 2025, Li et al., 7 Mar 2025).
  • Sparsity and sample efficiency: For extreme shortlists or pruned drafters, out-of-vocabulary token handling is required; recent methods employing token-affinity redistribution address this (Timor et al., 2 Jun 2025).

Continued progress is expected from further convex reductions, adaptive candidate selection, improved draft modeling (e.g., feature-coherent or position-specialist drafts), and flexible integration with production LLM architectures.


For foundational detail and state-of-the-art methodology, see "Global Resolution: Optimal Multi-Draft Speculative Sampling via Convex Minimization" (Thomas et al., 19 Nov 2025), as well as theoretical analyses and efficient subset-reduction in (Hu et al., 26 Feb 2025, Khisti et al., 2024, Sun et al., 2023).

Topic to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Speculative Sampling with Direct Token-Level Drafting.