Speculative Sampling with Token-Level Drafting

Updated 5 February 2026

The paper introduces a novel acceleration technique that leverages a lightweight draft model and optimal transport for direct token-level verification, maintaining output fidelity.
It employs an algorithmic reduction from an intractable OT linear program to a tractable convex optimization, attaining high acceptance rates with low latency per token.
The approach is scalable and compatible with hardware optimizations, making it practical for production LLM pipelines with minimal statistical deviation.

Speculative sampling with direct token-level drafting is a family of acceleration techniques for LLM autoregressive decoding, wherein a lightweight draft model proposes multiple future token candidates per iteration, and the full target LLM verifies which candidates align with its own distribution. This paradigm combines efficient approximate drafting with exact or nearly-exact marginalization over the target distribution, yielding significant throughput improvements while provably maintaining output fidelity. The “multi-draft” extension formalizes token-level candidate verification as an instance of optimal transport (OT), and recent breakthroughs have reduced the intractable OT formulations to tractable convex or combinatorial optimization. This approach is now central to modern LLM inference at scale, as it addresses both computational and quality bottlenecks for large-vocabulary, long-context, and production settings.

1. Problem Formulation and Optimal Transport Foundations

At each decoding step, speculative sampling introduces a small draft model $q$ which, given the current context, proposes an $n$ -tuple of candidate tokens $y = (y_1, \ldots, y_n) \sim q^{(n)}$ . The main objective is to design a verification distribution $\pi(x|y)$ with two key properties: (a) it couples the draft and target distributions by ensuring $\sum_{y} \pi(x, y) = p(x)$ (target marginal) and $\sum_{x} \pi(x, y) = q^{(n)}(y)$ (draft marginal); and (b) it maximizes the probability of accepting a draft token, i.e., $x$ is one of $(y_1, \ldots, y_n)$ .

This framework can be stated as an optimal transport linear program (OT-LP): $\begin{aligned} &\max_{C_{x, y}} \sum_{y \in V^n} \sum_{x \in \mathrm{set}(y)} C_{x, y} \ \text{s.t.}\quad &\sum_{y \in V^n} C_{x, y} = p(x) \quad\forall x \in V \ &\sum_{x \in V} C_{x, y} = q^{(n)}(y) \quad\forall y \in V^n \ &C_{x, y} \geq 0 \end{aligned}$ where $V$ is the vocabulary, and $\mathrm{set}(y) = \{y_1,\ldots,y_n\}$ . The optimal value yields the maximal multi-draft acceptance rate. However, this LP involves $O(V^{n+1})$ variables and becomes computationally infeasible for realistic $|V|$ and moderate $n$ (Thomas et al., 19 Nov 2025, Khisti et al., 2024, Hu et al., 26 Feb 2025, Sun et al., 2023).

2. Algorithmic Reduction: From OT-LP to Efficient Token-Level Verification

Substantial progress has been made in reducing the complexity of the OT-LP for practical deployment:

Relaxed OT-LP and Submodular Dual Reduction: By relaxing constraints to inequalities (substituting equalities by $\leq$ ) and dualizing (applying total unimodularity), the dual problem reduces to binary variables associated with token subsets. The acceptance rate becomes $1 + \min_{H \subset V} \psi(H)$ with $\psi(H) = \sum_{x \in H} p(x) - \sum_{y \in H^n} q^{(n)}(y)$ , and the minimization leverages the submodular structure of $\psi$ under i.i.d. draft sampling (Thomas et al., 19 Nov 2025, Hu et al., 26 Feb 2025, Khisti et al., 2024).
Max-Flow and Polymatroid Interpretation: The relaxed LP is equivalent to a max-flow over a bipartite graph linking tokens and draft tuples, but further analysis shows the problem can be mapped to a convex minimization in $O(V)$ variables. Polymatroid theory enables efficient calculation of residuals and subset blockwise structure, which forms the core of the “Global Resolution” algorithm (Thomas et al., 19 Nov 2025).
Convex Optimization Block Structure: The optimal solution exhibits block structure: for $x \in H^*$ and $y \in (H^*)^n$ , a convex solver (Levenberg-Marquardt or L-BFGS) computes log-space parameters over small truncation subsets $T$ , leading to fast inference under practical runtime budgets (Thomas et al., 19 Nov 2025).
Approximate Schemes: For higher $n$ or large $V$ , initial works proposed efficient greedy or sequential algorithms with multiplicative or additive approximation guarantees—e.g., the $k$ -Seq algorithm in SpecTr guarantees at least a $(1-1/e)$ -fraction of the optimal acceptance probability (Sun et al., 2023).

3. Empirical Performance and Trade-offs

The Global Resolution algorithm demonstrates substantial empirical gains:

Acceptance Rate Scaling: For typical i.i.d. drafts from a base model and top- $k$ truncation, increasing $n$ from 2 to 5 boosts acceptance rates by 2–4 percentage points per draft. Acceptance rates $>90\%$ are observed with $n=4$ , $k=1000$ (Thomas et al., 19 Nov 2025).
Runtime: General LP/max-flow solvers are intractable beyond $k \sim 100$ ; the convex reduction approach (with $|T| \sim 50$ for $n=2$ , $|T| \sim 10$ for $n=4$ ) gives $<100$ ms per token, and as low as $<10$ ms for moderate $n, k$ .
Quality: Deviation from the target distribution is negligible ( $L_1$ error $\lesssim 0.015$ for tolerance $\tau=10^{-3}$ ). Hardware measurements confirm full compatibility with LLM serving pipelines.

A selection of empirical results is summarized below:

Task/Setting	n	k	Acceptance	Overhead (ms/tok)
Llama-3 70B	4	1000	95%	<100
Gemma-2 27B	2	50	90%+	<10

Source: (Thomas et al., 19 Nov 2025)

Increasing $k$ (the number of draft tokens considered per step) initially raises acceptance but exhibits diminishing returns beyond $k\gtrsim 1000$ . Tuning $n$ and $k$ for the model and hardware budget is essential to optimize trade-offs.

4. Practical Implementation and Recommendations

The efficient speculative sampler via Global Resolution follows this workflow:

Sort tokens by decreasing $q(x)/p(x)$ , scan to find the minimizing subset $H^*$ .
Compute prefix minima of $\psi$ (for residuals $p_x$ ) using the polymatroid structure.
Blockwise convex solve:
- Inner system: If $y$ is wholly within $H^*$ , minimize $\Theta(\beta)$ over truncation $T\subset H^*$ .
- Outer system: If $y$ is outside $H^*$ , minimize $\Phi(\alpha)$ over $T\subset V\setminus H^*$ .
Build $C_{x,y}$ : From softmax transport with found $\alpha$ or $\beta$ .
Sample and accept: For each draft $y$ , compute and normalize $\pi(x|y)$ . Accept if $x \in \mathrm{set}(y)$ .

Tuning guidelines (Thomas et al., 19 Nov 2025):

$n=3$ –5 achieves optimal overhead/acceptance trade-off in most settings.
$k=100$ –1000 yields near-saturating acceptance for moderate $n$ .
Set solver tolerance $\tau$ to $10^{-3}$ for $<100$ ms latency and negligible statistical deviation.
Empirically, $\geq 90\%$ of tokens can be sampled in milliseconds on modern hardware.

The method is compatible with hardware optimizations (quantization, multi-query attention), multi-token block or tree drafting, and can be integrated into both research and production LLM pipelines.

5. Connections to Prior Work and Theoretical Limits

The multi-draft OT framework encompasses and extends previous speculative sampling schemes:

Early single-draft methods used maximal coupling via rejection sampling and achieved moderate acceptance/speedup (Chen et al., 2023, Yan et al., 2024).
Multi-draft and set selection approaches have been shown to correspond to supermodular subset minimization or IS+single-draft decompositions (Hu et al., 26 Feb 2025, Khisti et al., 2024).
Approximate “greedy” or “sequential” algorithms provide tractable relaxations with quantified performance loss (Sun et al., 2023, Khisti et al., 2024).
Empirically and theoretically, sampling without replacement for the draft candidates yields a few percentage points higher acceptance than with-replacement (Hu et al., 26 Feb 2025).

The fundamental acceptance rate achieved is both a function of the overlap of $p$ and $q$ and the diversity of the draft candidates. The maximal theoretical acceptance for a given $p, q, n$ is determined by the optimal transport solution; practical algorithms now achieve acceptance rates within $1$–$2$ points of this optimal (Thomas et al., 19 Nov 2025, Hu et al., 26 Feb 2025).

6. Limitations and Future Directions

Despite advances, several open challenges and limitations remain:

Scaling to ultra-large vocabularies: While convex reduction is efficient, very large $|V|$ or large $n$ can challenge memory and compute budgets.
Adaptive/heterogeneous draft schemes: Current theoretical guarantees assume i.i.d. draft samples; heterogeneous or context-dependent drafting requires further generalization, possibly combining context clustering as in DynaSpec or output-space pruning (Zhang et al., 11 Oct 2025).
Integration with hardware-aware pipelines: Realizing the theoretical gains in end-to-end systems requires parallelization and efficient kernel implementation, especially for draft model heads and verification (Zhao et al., 20 Feb 2025, Zhang et al., 11 Oct 2025).
Multi-step/tree-topology decoding: Extending OT-based optimality beyond single-step token-level drafting to tree or sequence-level speculative sampling is an active direction (Huang et al., 4 Jun 2025, Li et al., 7 Mar 2025).
Sparsity and sample efficiency: For extreme shortlists or pruned drafters, out-of-vocabulary token handling is required; recent methods employing token-affinity redistribution address this (Timor et al., 2 Jun 2025).

Continued progress is expected from further convex reductions, adaptive candidate selection, improved draft modeling (e.g., feature-coherent or position-specialist drafts), and flexible integration with production LLM architectures.

For foundational detail and state-of-the-art methodology, see "Global Resolution: Optimal Multi-Draft Speculative Sampling via Convex Minimization" (Thomas et al., 19 Nov 2025), as well as theoretical analyses and efficient subset-reduction in (Hu et al., 26 Feb 2025, Khisti et al., 2024, Sun et al., 2023).

Markdown Upgrade to Chat

References (11)

Global Resolution: Optimal Multi-Draft Speculative Sampling via Convex Minimization (2025)

Multi-Draft Speculative Sampling: Canonical Decomposition and Theoretical Limits (2024)

Towards Optimal Multi-draft Speculative Decoding (2025)

SpecTr: Fast Speculative Decoding via Optimal Transport (2023)

Accelerating Large Language Model Decoding with Speculative Sampling (2023)

Decoding Speculative Decoding (2024)

DynaSpec: Context-aware Dynamic Speculative Sampling for Large-Vocabulary Language Models (2025)

FR-Spec: Accelerating Large-Vocabulary Language Models via Frequency-Ranked Speculative Sampling (2025)

POSS: Position Specialist Generates Better Draft for Speculative Decoding (2025)

10.

Speculative Decoding for Multi-Sample Inference (2025)

11.

Out-of-Vocabulary Sampling Boosts Speculative Decoding (2025)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Speculative Sampling with Direct Token-Level Drafting.