Papers
Topics
Authors
Recent
Assistant
AI Research Assistant
Well-researched responses based on relevant abstracts and paper content.
Custom Instructions Pro
Preferences or requirements that you'd like Emergent Mind to consider when generating responses.
Gemini 2.5 Flash
Gemini 2.5 Flash 134 tok/s
Gemini 2.5 Pro 41 tok/s Pro
GPT-5 Medium 33 tok/s Pro
GPT-5 High 32 tok/s Pro
GPT-4o 101 tok/s Pro
Kimi K2 174 tok/s Pro
GPT OSS 120B 434 tok/s Pro
Claude Sonnet 4.5 37 tok/s Pro
2000 character limit reached

Staged Speculative Decoding in LLMs

Updated 8 October 2025
  • Staged speculative decoding is a high-efficiency inference strategy for autoregressive LLMs that decomposes generation into coordinated speculation and verification stages.
  • It uses a tree-structured candidate batching method to enhance parallelism, reduce memory bandwidth, and accelerate token generation.
  • Empirical results show up to 3.16× speedup with greedy decoding while preserving output fidelity, making it especially effective for on-device applications.

Staged speculative decoding is a high-efficiency inference strategy for autoregressive LLMs that decomposes generation into multiple coordinated speculation and verification phases, with the goal of maximizing parallelism, minimizing memory bandwidth consumption, and preserving model output quality. This framework generalizes classical speculative decoding by adding structural and algorithmic innovations—such as tree-structured candidate generation and multi-level speculation—yielding substantial gains in throughput and hardware utilization, particularly in small-batch and on-device scenarios (Spector et al., 2023).

1. Core Algorithmic Components and Process

In staged speculative decoding, the inference workflow consists of several key phases, each involving distinct model components:

  1. Draft Candidate Generation (Stage 1): A fast, lightweight draft model proposes multiple future candidate tokens based on the current context. Rather than emitting a single sequential candidate, the draft organizes candidates in a tree structure, exploring several plausible continuations per decoding step.
  2. Speculation on Draft Model (Stage 2, “Staged Speculation”): The draft model itself is accelerated using an even simpler model (e.g., Katz backoff n-gram model) via another round of speculative decoding. This recursive application (“staged” structure) ensures that both draft and verification can be processed in parallel, substantially reducing the number of slow, memory-bound, token-by-token computations.
  3. Oracle Verification: The original, large LLM (“oracle”) checks the drafted tokens. When the oracle’s outputs match the draft’s predicted tokens—and their probability distributions agree—multiple tokens are immediately accepted, reducing the need for sequential verification. This preserves the target model's output distribution.

The following schematic illustrates the tree structure:

1
2
3
4
5
(root)
     /    \
   t1      t2
  /  \    /  \
t1a t1b t2a t2b
Each branch expands with multiple alternatives, forming a batch of candidate continuations that can be efficiently verified in parallel.

2. Tree-Structured Speculative Batching

A principal innovation is the tree-based restructure of speculative batches. Traditional speculative decoding extends a single sequence, but as the probability of agreement between draft and oracle decays exponentially with length, deep flat sequences are inefficient and lead to diminishing returns. The tree-structured batch addresses this by:

  • Increasing expected yield per batch: By branching alternatives at earlier positions (breadth over depth), the system gains more correct tokens per verification cycle, boosting accepted batch size.
  • Efficient computation allocation: Processing within the tree allows inner nodes to be computed once, while branching early covers more plausible output regions without redundant draft model executions for each extension.
  • Parallelism and hardware compatibility: Multiple leaf nodes (candidate sequences) can be verified together, improving compatibility with hardware accelerators that prefer wider, batched operations.

The expected tokens per batch for a tree with branching factor bb and depth dd can be approximated as:

E[T]k=1dP(correct up to level k)(number of leaf nodes at level k)E[T] \approx \sum_{k=1}^{d} P(\text{correct up to level }k) \cdot \left(\text{number of leaf nodes at level }k\right)

using independence assumptions for token agreement at each branch point.

3. Recursive (“Staged”) Speculation and Draft Model Acceleration

A further advancement is the application of speculative decoding recursively on the draft model (“staged” speculation):

  • Draft Acceleration: When increasing the batch size or tree width, the original speculative step can become dominated by the cost of running the draft, rather than the oracle. To overcome this, a second speculative stage is introduced where the draft model is, itself, accelerated by an even simpler draft (e.g., a statistical LLM), yielding speculative candidates that populate the upper levels of the tree.
  • Batch-wise execution: In practice, this allows the draft to perform a small number of batched forward passes (equal to tree depth), instead of sequentially processing every candidate sequence.

This hierarchical approach enables deeper candidate exploration without linearly increasing compute or memory bandwidth usage.

4. Quantitative Performance Impact and Trade-offs

Staged speculative decoding demonstrates strong empirical gains:

Setting Speedup vs. Baseline Relative Bandwidth Usage
Greedy (deterministic) 3.16× 0.23
Top-k Sampling (k=50) 1.98× -
  • Throughput: For a 762M parameter GPT-2 Large oracle model, speedups of 3.16× (greedy) and 1.98× (top-k sampling) are achieved, with output quality perfectly preserved.
  • Bandwidth efficiency: Memory bandwidth demand is notably reduced versus both baseline and conventional speculative decoding (standard speculation uses 0.31 bandwidth relative to baseline, tree restructuring reduces this to 0.23).
  • Output quality: Every output token is verified by the oracle, guaranteeing that output distribution matches standard autoregressive decoding.
  • Scalability: The approach can potentially be extended to even more stages (e.g., 20B → 1B → 50M → n-gram) as model sizes increase.

5. Implementation Considerations and System Design

To implement staged speculative decoding efficiently:

  • Tree construction: Candidates should be scored and batched according to the joint probability of agreement up to each level; branching factor and tree depth must be chosen to match hardware and memory constraints.
  • Stage coordination: Each speculative stage must ensure that downstream stages (and the final oracle check) can deterministically verify whether candidate tokens conform to the oracle’s next-token distribution.
  • Memory management: For on-device scenarios, memory use is a primary consideration; the method is effective at reducing bandwidth even for single-batch sizes.
  • Orthogonality: Staged speculative decoding is compatible with other acceleration methods (e.g., weight quantization, FlashAttention), and such combinations could yield further performance gains.

6. Application Domains and Broader Implications

Staged speculative decoding is particularly well suited to:

  • On-device LLM inference: The reduction in latency and memory footprint enables real-time, local execution of large models (e.g., on consumer GPUs).
  • Personalized/private models: Supports personalization and private inference scenarios where cloud-based solutions are infeasible.
  • General sequential generation: The approach generalizes to other tasks with autoregressive structure, such as code completion and dialogue.

Longer term, the tree-structured approach and staged speculation open new research directions for:

  • Adaptive and progressive candidate generation: Motivating the design of multi-stage, multi-model candidate generation, potentially with learned or dynamic branch selection.
  • Efficient verification protocols: Inspiring similar hierarchical or tree-based speculation in other sequential decision domains.

7. Summary

Staged speculative decoding is a hierarchical, recursive enhancement of speculative decoding for LLMs, introducing tree-based candidate batching and draft model acceleration. Its design achieves large reductions in decoding latency and memory bandwidth without compromising output fidelity, and is especially impactful for on-device and memory-constrained inference. The framework generalizes well to multi-stage model cascades, offers strong quantitative improvements, and points toward future acceleration strategies for large-scale autoregressive generation (Spector et al., 2023).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)
Forward Email Streamline Icon: https://streamlinehq.com

Follow Topic

Get notified by email when new papers are published related to Staged Speculative Decoding.