SwiftSpec: Ultra-Low Latency Decoding for LLMs
- SwiftSpec is an ultra-low latency decoding system for LLMs that employs asynchronous, parallel tree generation and optimized GPU kernels.
- It decouples draft and verification phases to enable concurrent candidate generation and token validation, eliminating sequential decoding bottlenecks.
- The system achieves a 1.75× speedup over previous methods and delivers high throughput, exemplified by 348 tokens/s on Llama3-70B with 8 Nvidia Hopper GPUs.
SwiftSpec is an ultra-low latency decoding system for LLMs that advances the state of the art in speculative decoding through a redesign of the decoding pipeline. By introducing asynchronous and disaggregated processing, parallel tree-based generation of candidate outputs, tree-aware key–value (KV) cache management, and a set of fused, latency-optimized GPU kernels, SwiftSpec overcomes key bottlenecks present in previous approaches, most notably the sequential dependencies of draft and verification stages and inefficiencies arising from the divergent computational profiles of small draft models and large target models. SwiftSpec achieves an average 1.75× speedup against prior systems and delivers high-throughput LLM serving (e.g., Llama3-70B at 348 tokens/s on 8 Nvidia Hopper GPUs), making it the fastest known system for low-latency LLM inference at this scale (Zhang et al., 12 Jun 2025).
1. Motivation and Problem Setting
Interactive applications of LLMs, such as chatbots and code assistants, demand rapid generation of long token sequences while minimizing end-to-end latency, even for single incoming requests. Traditional speculative decoding leverages a smaller draft model to generate multiple tokens and then has the larger target model verify the proposed sequence. However, previous approaches are inherently sequential, making the draft phase a bottleneck, and do not efficiently harness tensor-parallel hardware, especially when model sizes and compute needs differ significantly between the draft and target models. Additionally, they face cache inconsistency and communication challenges under small-batch tensor parallelism. SwiftSpec addresses these interrelated constraints by fully decoupling the draft and verification phases and tightly coordinating both with parallel algorithms and hardware-efficient primitives.
2. Asynchronous Speculative Decoding Pipeline
SwiftSpec implements a fundamentally asynchronous and disaggregated decoding pipeline. Instead of generating a candidate sequence with the draft model and then verifying it with the target model in lock-step, SwiftSpec organizes GPU resources into draft and target groups:
- The draft group generates a tree of candidate sequences speculatively, employing maximum-likelihood expansion to maximize the probability that generated tokens match the target model.
- Simultaneously, the target group verifies the prior batch of candidate tokens. Once verification is complete, the validated tokens are handed back to the draft group.
- The pipeline thus removes the draft phase from the critical path. Both generation and verification occur concurrently, maximizing utilization across asymmetric workloads.
This asynchronous, non-blocking architecture allows each group to scale independently; crucial for matching the divergent scaling behavior of small draft models and large target models under tensor parallelism.
3. Parallel Tree Generation Method
A central technical contribution of SwiftSpec is parallel tree generation in the draft phase. Rather than speculatively expanding a single linear sequence, the draft model generates a search tree where:
- Each node represents a candidate token (or sequence).
- The value of a node is calculated as .
- The sum of the values along a path from the root to a node provides the cumulative likelihood for that candidate.
- A priority queue (complexity for selected among leaves) is used to select and expand the highest-likelihood leaves in parallel.
This parallel speculative expansion facilitates a higher "compression ratio"—the proportion of proposed tokens accepted in each round—thus reducing the frequency of expensive synchronizations with the verification phase and maximizing GPU utilization.
4. Tree-Aware KV Cache Management
Handling key–value caches efficiently is critical for high-throughput LLM inference, especially when speculative branches and cache reuse are involved. SwiftSpec introduces a two-part KV cache architecture:
- Prefix Cache: Stores KV pairs for the prefix of the sequence that has been accepted (i.e., verified by the target model).
- Tree Cache: Holds KV pairs for speculative branches in the tree that originate from the last verified prefix.
After each verification round, cache management logic retains only those speculative branches consistent with the verified token sequence and appends them to the prefix cache. This avoids costly recomputation and ensures cache coherence between parallel draft and target operations.
5. Latency-Optimized and Fused GPU Kernels
SwiftSpec incorporates custom kernel-level optimizations tailored for low-latency LLM serving with small batch sizes:
- Fused GEMM-AllReduce: Matrix multiply (GEMM) and AllReduce (collective communication across GPU nodes) are fused into a single operation, minimizing memory movement, synchronization, and kernel launch overhead. The implementation makes use of Nvidia NCCL’s Low Latency (LL) protocol, with fine-grained synchronization using storeLL and readLL primitives.
- Optimized Attention Operators: The masked attention implementation supports non-square masks required for irregular tree structures and fuses positional embedding logic.
- Fused SwiGLU Operator: The SwiGLU activation,
is implemented in a single kernel to minimize data transfer and synchronization costs.
These fused kernels are essential for maintaining hardware efficiency when the batch sizes are small—as is typical in streaming or interactive settings—and when global synchronization would otherwise dominate latency.
6. Performance Results
Extensive evaluation over five model families (including Llama and Alpaca) and six standard datasets (e.g., MT-Bench, GSM8K, HumanEval) demonstrates:
- On average, SwiftSpec achieves a 1.75× speedup compared to state-of-the-art speculative decoding systems across all datasets and models tested.
- For Llama3-70B served on 8 Nvidia Hopper GPUs, SwiftSpec delivers a peak throughput of 348 tokens/sec.
- Compression ratios and speedups are maintained even as model size and length of generated output grow, establishing practical scalability for both real-time applications and larger deployments.
7. Implementation Details and Technical Specifications
SwiftSpec employs configuration choices informed by empirical profiling and model scaling laws:
Component | Algorithm/Value | Rationale |
---|---|---|
Node scoring | Enables maximum-likelihood expansion | |
Batch size | 8 tokens (typical) | Aligns with minimum efficient GPU kernel |
GPU allocation | e.g., 6:2 (target:draft on 8-GPU node) | Matches scaling benefit to model size |
Priority queue | complexity | Parallel expansion of high-likelihood keys |
- Cache Management: Verified tokens are merged into the prefix; non-matching speculative branches pruned.
- Fused Kernels: Synchronized using storeLL/readLL in threads; avoids explicit barriers.
8. Significance and Implications
SwiftSpec’s architectural and algorithmic innovations address the limitations of prior speculative decoding strategies by decoupling the computation of candidate tokens and their verification, enabling both to proceed in parallel and at full hardware efficiency. These techniques are particularly relevant for:
- Latency-sensitive LLM applications, including live chat and code completion.
- Scenarios requiring high throughput under single or small-batch query loads.
- Deployments on large, heterogeneous multi-GPU infrastructure, where resource contention and communication overheads were previously limiting factors.
The system’s design is well-positioned to influence future ultra-low latency LLM infrastructure and may inform similar optimizations in other domains requiring parallel speculative execution with dynamic cache management.
SwiftSpec thus constitutes a substantial advancement in the field of high-performance LLM serving and decoding, setting new empirical and practical benchmarks for both latency and throughput (Zhang et al., 12 Jun 2025).