Sequoia introduces a drop-in replacement for the standard “loop-over-tokens” inference pattern of LLMs. It accelerates generation while preserving output distribution by (1) speculating a tree of candidate continuations, (2) verifying that tree in one target-model forward pass, and (3) accepting the longest prefix that is provably identical to what the target model would have produced. Compared with prior speculative decoding work (sequence–based, independent sequences, SpecInfer, SpecTr, Medusa, etc.), Sequoia scales to larger speculation budgets, stays robust across sampling temperatures, and chooses tree shapes that map well to real hardware.
Why this matters in practice
- Single-GPU serving: 3 – 4× end-to-end latency reduction for 7 – 13 B models on A100/L40 without quality loss.
- CPU↔GPU offloading: up to 10× speed-up (0.56 s/token for Llama-2-70B on an L40 with system memory offload).
- No retraining, distillation, quantization, or model surgery. Sequoia wraps any Hugging Face model.
1. Core ideas & algorithms
1.1 Optimal token-tree construction (dynamic programming) Given a speculation budget of n nodes and depth limit d, Sequoia maximises the expected accepted-token count where and acceptance probability of the child position (empirically measured once per model pair).
Dynamic-programming recurrence (unbounded depth):
1 |
c(n) = 1 + max_{a_1+...+a_{n-1}=n-1} Σ_{i=1}^{n-1} p_i · c(a_i) (Eq. 1) |
1 |
c(n,d) = 1 + max_{a_1+...+a_{n-1}=n-1} Σ_{i=1}^{n-1} p_i · c(a_i,d-1) |
Key takeaway – expected accepted tokens grow ~ (unbounded), whereas previous handcrafted trees plateau.
1.2 Robust sampling + verification SpecInfer/SpecTr can resample the same “bad” draft token and stall at low temperature. Sequoia tweaks the SpecInfer loop:
1 2 3 4 5 6 7 8 9 10 11 12 13 |
draft_dist = Q.copy() rejected = set() for i in range(k): x = sample(draft_dist, without_replacement=True) if accept(x, target=P, draft=draft_dist): return x # accepted rejected.add(x) # Update residual and zero out rejected token P = normalize(relu(P - draft_dist)) draft_dist[x] = 0 if draft_dist.sum() == 0: # exhausted Q support draft_dist = uniform_over(V \ rejected) return sample(P) |
Properties
- Optimal-transport property (best possible accept-rate when k=1).
- Cover property (accepts in ≤k steps when draft support ⊇ target support). => Works from T=0.2 to 1.0 and top-p 0.8-1.0.
1.3 Hardware-aware tree optimiser Real GPU batches are not O(1) with respect to verified-token count. Measure once:
1 2 3 |
def t(n): """Relative time of verifying n tokens on target model.""" return empirical_measure(n) / empirical_measure(1) |
Speed-up model Speedup(n,d)= G(n,d) / ( t(n) + d·c ) where c = (draft-time per token)/(verify-time per token). Grid-search n,d using measured t(·); choose different sweet spots for A100 (n≈64-128) vs offload (n≈768).
2. Implementing Sequoia step-by-step
❶ Collect acceptance vector
1 2 3 4 5 6 7 8 9 10 11 12 13 |
def estimate_p(draft, target, dataset, k_max=512, N=200): counts = np.zeros(k_max); totals = np.zeros(k_max) for prompt in dataset[:N]: toks = tokenizer(prompt)['input_ids'] with torch.no_grad(): q = draft(...); p = target(...) # compute α(x)=min(1,p/q) per vocab; project onto ranks ratio = torch.minimum(torch.ones_like(p), p/q) sorted_idx = torch.argsort(q, descending=True) for rank,i in enumerate(sorted_idx[:k_max]): totals[rank]+=1 counts[rank]+=ratio[i].item() return counts/totals |
❷ Dynamic-programming tree builder
Pre-compute best_size_depth[n] [d]
and store child-size splits to regenerate tree topology once at runtime.
❸ CUDA Graphs for deterministic shapes Once (size,depth) fixed, fuse target forward into a single graph; pre-allocate kv-cache for maximum depth.
❹ Sampling-without-replacement at GPU speed Use “exponential-sort” (Gumbel-top-k) to draw k unique ids from categorical in O(k+V) kernel; wrap in CUDA graph.
❺ Integration into HF generation loop
1 2 3 4 5 6 7 8 9 10 |
while not finished: if tree_empty: # build fresh tree via draft model(s) tree_logits = draft.forward(...) tree_tokens = sample_k_ary(...) # verify root-to-leaf path with target t_logits = target.forward(...) accept_len = accept_prefix(tree_tokens, t_logits) output.extend(tree_tokens[:accept_len]) # prune tree; if accepted entire depth, clear tree |
3. Performance & scaling tips
GPU (A100/L40)
- Use fp16 for draft, bf16/fp16 for target; draft batch = tree size / depth.
max_depth≈6-10
keeps draft latency hidden behind target verify.
Offload (FlexGen, vLLM-paged-attention)
- Tree size should saturate PCIe; measure t(n) on your link (often linear ≥128).
- Pin draft to GPU RAM to avoid PCIe twice.
Multi-request batching
- Verify pass runs on (batch_size × tree_size) tokens – re-measure t(n).
- Tree optimiser naturally shrinks n when batch grows.
4. Practical trade-offs
Pros ✓ Exact output distribution; safe for deterministic generation requirements. ✓ Compatible with quantised/ pruned/ MoE targets (only logits needed). ✓ Orthogonal to distillation or activation sparsity.
Cons / considerations ✗ Extra engineering to collect acceptance vector and tune DP. ✗ Memory for draft model (~10-20 % total) on small GPUs. ✗ Very low-entropy outputs (temperature <0.1) still yield modest gains (fewer branches accepted). ✗ Heavily quantised (<4-bit) drafts may diverge → lower acceptance.
5. When to use Sequoia
- Latency-critical single-user chatbots (depth≈8, size≈64).
- Edge/offload servers where model weights stream from CPU/NVMe.
- Batch-inference pipelines (speech-to-text, doc-summaries) needing exactness.
- Foundation-model evaluation harnesses (identical outputs required).
6. Limitations & future work
- Current DP assumes positional independence of acceptance; extending to per-token contextual rates could close remaining gap.
- Tree optimiser grid-search may miss global optimum for exotic hardware (TPUs with matmul start-up costs).
- Combining Sequoia with lenient acceptance (Medusa-style multiple heads) could push speed >10× while controlling KL-divergence.
Bottom line
If you already run speculative decoding, swapping in Sequoia’s DP tree + non-replacement sampling gives you ~30 – 60 % extra accepted tokens and more predictable gains across temperatures. With the hardware-aware optimiser, you can reliably hit 3-4× latency speed-ups on GPU and an order-of-magnitude on CPU↔GPU offload, all without touching model weights.