Sequoia: Scalable, Robust, and Hardware-aware Speculative Decoding (2402.12374v2)

Published 19 Feb 2024 in cs.CL

Abstract: As the usage of LLMs grows, performing efficient inference with these models becomes increasingly important. While speculative decoding has recently emerged as a promising direction for speeding up inference, existing methods are limited in their ability to scale to larger speculation budgets, and adapt to different hyperparameters and hardware. This paper introduces Sequoia, a scalable, robust, and hardware-aware algorithm for speculative decoding. To attain better scalability, Sequoia introduces a dynamic programming algorithm to find the optimal tree structure for the speculated tokens. To achieve robust speculative performance, Sequoia uses a novel sampling and verification method that outperforms prior work across different decoding temperatures. Finally, Sequoia introduces a hardware-aware tree optimizer that maximizes speculative performance by automatically selecting the token tree size and depth for a given hardware platform. Evaluation shows that Sequoia improves the decoding speed of Llama2-7B, Llama2-13B, and Vicuna-33B on an A100 by up to $4.04\times$, $3.73\times$, and $2.27\times$. For offloading setting on L40, Sequoia achieves as low as 0.56 s/token for exact Llama2-70B inference latency, which is $9.96\times$ on our optimized offloading system (5.6 s/token), $9.7\times$ than DeepSpeed-Zero-Inference, $19.5\times$ than Huggingface Accelerate.

PDF Abstract

Sequoia introduces a drop-in replacement for the standard “loop-over-tokens” inference pattern of LLMs. It accelerates generation while preserving output distribution by (1) speculating a tree of candidate continuations, (2) verifying that tree in one target-model forward pass, and (3) accepting the longest prefix that is provably identical to what the target model would have produced. Compared with prior speculative decoding work (sequence–based, $k$ independent sequences, SpecInfer, SpecTr, Medusa, etc.), Sequoia scales to larger speculation budgets, stays robust across sampling temperatures, and chooses tree shapes that map well to real hardware.

Why this matters in practice

Single-GPU serving: 3 – 4× end-to-end latency reduction for 7 – 13 B models on A100/L40 without quality loss.
CPU↔GPU offloading: up to 10× speed-up (0.56 s/token for Llama-2-70B on an L40 with system memory offload).
No retraining, distillation, quantization, or model surgery. Sequoia wraps any Hugging Face model.

1. Core ideas & algorithms

1.1 Optimal token-tree construction (dynamic programming) Given a speculation budget of n nodes and depth limit d, Sequoia maximises the expected accepted-token count $F(T)=\sum_{v\in T}f(v)$ where $f(v)=\prod_{i\in\text{Path}(v)}p_i$ and $p_i=$ acceptance probability of the $i^{\text{th}}$ child position (empirically measured once per model pair).

Dynamic-programming recurrence (unbounded depth):

1	c(n) = 1 + max_{a_1+...+a_{n-1}=n-1} Σ_{i=1}^{n-1} p_i · c(a_i) (Eq. 1)

With depth constraint d:

1	c(n,d) = 1 + max_{a_1+...+a_{n-1}=n-1} Σ_{i=1}^{n-1} p_i · c(a_i,d-1)

Complexity ≈ O(n²·k) with k = max branching factor (k≤vocab).

Key takeaway – expected accepted tokens grow ~ $Ω\!\left[b·\log n / \log\log n\right]$ (unbounded), whereas previous handcrafted trees plateau.

1.2 Robust sampling + verification SpecInfer/SpecTr can resample the same “bad” draft token and stall at low temperature. Sequoia tweaks the SpecInfer loop:

draft_dist = Q.copy()
rejected = set()
for i in range(k):
    x = sample(draft_dist, without_replacement=True)
    if accept(x, target=P, draft=draft_dist):
        return x                     # accepted
    rejected.add(x)
    # Update residual and zero out rejected token
    P = normalize(relu(P - draft_dist))
    draft_dist[x] = 0
    if draft_dist.sum() == 0:        # exhausted Q support
        draft_dist = uniform_over(V \ rejected)
return sample(P)

Properties

Optimal-transport property (best possible accept-rate when k=1).
Cover property (accepts in ≤k steps when draft support ⊇ target support). => Works from T=0.2 to 1.0 and top-p 0.8-1.0.

1.3 Hardware-aware tree optimiser Real GPU batches are not O(1) with respect to verified-token count. Measure once:

1
2
3

def t(n):
    """Relative time of verifying n tokens on target model."""
    return empirical_measure(n) / empirical_measure(1)

Speed-up model Speedup(n,d)= G(n,d) / ( t(n) + d·c ) where c = (draft-time per token)/(verify-time per token). Grid-search n,d using measured t(·); choose different sweet spots for A100 (n≈64-128) vs offload (n≈768).

2. Implementing Sequoia step-by-step

❶ Collect acceptance vector

def estimate_p(draft, target, dataset, k_max=512, N=200):
    counts = np.zeros(k_max); totals = np.zeros(k_max)
    for prompt in dataset[:N]:
        toks = tokenizer(prompt)['input_ids']
        with torch.no_grad():
            q = draft(...); p = target(...)
        # compute α(x)=min(1,p/q) per vocab; project onto ranks
        ratio = torch.minimum(torch.ones_like(p), p/q)
        sorted_idx = torch.argsort(q, descending=True)
        for rank,i in enumerate(sorted_idx[:k_max]):
            totals[rank]+=1
            counts[rank]+=ratio[i].item()
    return counts/totals

❷ Dynamic-programming tree builder Pre-compute best_size_depth[n] [d] and store child-size splits to regenerate tree topology once at runtime.

❸ CUDA Graphs for deterministic shapes Once (size,depth) fixed, fuse target forward into a single graph; pre-allocate kv-cache for maximum depth.

❹ Sampling-without-replacement at GPU speed Use “exponential-sort” (Gumbel-top-k) to draw k unique ids from categorical in O(k+V) kernel; wrap in CUDA graph.

❺ Integration into HF generation loop

while not finished:
    if tree_empty:
        # build fresh tree via draft model(s)
        tree_logits = draft.forward(...)
        tree_tokens = sample_k_ary(...)
    # verify root-to-leaf path with target
    t_logits = target.forward(...)
    accept_len = accept_prefix(tree_tokens, t_logits)
    output.extend(tree_tokens[:accept_len])
    # prune tree; if accepted entire depth, clear tree

3. Performance & scaling tips

GPU (A100/L40)

Use fp16 for draft, bf16/fp16 for target; draft batch = tree size / depth.
max_depth≈6-10 keeps draft latency hidden behind target verify.

Offload (FlexGen, vLLM-paged-attention)

Tree size should saturate PCIe; measure t(n) on your link (often linear ≥128).
Pin draft to GPU RAM to avoid PCIe twice.

Multi-request batching

Verify pass runs on (batch_size × tree_size) tokens – re-measure t(n).
Tree optimiser naturally shrinks n when batch grows.

4. Practical trade-offs

Pros ✓ Exact output distribution; safe for deterministic generation requirements. ✓ Compatible with quantised/ pruned/ MoE targets (only logits needed). ✓ Orthogonal to distillation or activation sparsity.

Cons / considerations ✗ Extra engineering to collect acceptance vector and tune DP. ✗ Memory for draft model (~10-20 % total) on small GPUs. ✗ Very low-entropy outputs (temperature <0.1) still yield modest gains (fewer branches accepted). ✗ Heavily quantised (<4-bit) drafts may diverge → lower acceptance.

5. When to use Sequoia

Latency-critical single-user chatbots (depth≈8, size≈64).
Edge/offload servers where model weights stream from CPU/NVMe.
Batch-inference pipelines (speech-to-text, doc-summaries) needing exactness.
Foundation-model evaluation harnesses (identical outputs required).

6. Limitations & future work

Current DP assumes positional independence of acceptance; extending to per-token contextual rates could close remaining gap.
Tree optimiser grid-search may miss global optimum for exotic hardware (TPUs with matmul start-up costs).
Combining Sequoia with lenient acceptance (Medusa-style multiple heads) could push speed >10× while controlling KL-divergence.

Bottom line

If you already run speculative decoding, swapping in Sequoia’s DP tree + non-replacement sampling gives you ~30 – 60 % extra accepted tokens and more predictable gains across temperatures. With the hardware-aware optimiser, you can reliably hit 3-4× latency speed-ups on GPU and an order-of-magnitude on CPU↔GPU offload, all without touching model weights.

PDF Markdown Bookmark Chat (Pro)

Authors (7)

Zhuoming Chen (15 papers)
Avner May (12 papers)
Ruslan Svirschevski (6 papers)
Yuhsun Huang (1 paper)
Max Ryabinin (29 papers)
Zhihao Jia (43 papers)
Beidi Chen (61 papers)

Citations (22)

View on Semantic Scholar

Related Papers

Find Related Papers

Tweets

https://twitter.com/arankomatsuzaki/status/1759778428801187871

https://twitter.com/gm8xx8/status/1759786998137594212

https://twitter.com/arxivsanitybot/status/1760121809004114202

https://twitter.com/alignment_lab/status/1760671336211640546

https://twitter.com/knishimae0531/status/1760100815807950938