Papers
Topics
Authors
Recent
Search
2000 character limit reached

SSV: Sparse Speculative Verification for Efficient LLM Inference

Published 19 May 2026 in cs.OS | (2605.19893v2)

Abstract: Speculative decoding and dynamic sparse attention are two complementary approaches for accelerating long-context LLM inference: the former amortizes target-model execution across multiple verifier queries, while the latter reduces each query's KV-cache working set. Directly combining them, however, exposes a structural mismatch: speculative verification relies on cross-query commonality, whereas dynamic sparse attention assigns query-specific sparse layouts. This mismatch limits KV-block reuse, amplifies NSA's branch-wise overheads, and makes verification strategy selection input- and regime-dependent. We present SSV, a sparse speculative-verification framework that turns dynamic sparse attention into a verification-oriented workload. SSV combines overlap-aware grouped-query execution, refresh/reuse-based NSA kernel fusion, and profile-guided prompt-adaptive orchestration to improve cross-query reuse, reduce selected-index and branch-fusion overheads, and select effective draft-verification strategies under user-specified precision classes. Experiments on NVIDIA H100 GPUs show that SSV achieves up to 3.49x end-to-end throughput over autoregressive NSA decoding and up to 6.86x kernel speedups for sparse speculative verification.

Summary

  • The paper introduces a novel SSV framework that integrates speculative decoding with dynamic sparse attention to achieve up to 3.49x throughput improvements.
  • It employs overlap-aware kernel design along with refresh/reuse-based NSA fusion and prompt-adaptive orchestration to reduce latency and kernel overhead.
  • Empirical evaluations on NVIDIA H100 GPUs demonstrate significant gains, including 6.86x kernel speedups, while preserving model accuracy.

SSV: Sparse Speculative Verification for Efficient LLM Inference

Motivation and Structural Challenges

SSV addresses the compound efficiency limits in long-context autoregressive LLM inference by jointly leveraging speculative decoding and dynamic sparse attention. Speculative decoding amortizes KV-cache access across a batch of verifier queries by employing lightweight draft-model generations and parallel verification with the target model. Dynamic sparse attention—specifically, Native Sparse Attention (NSA)—reduces the per-query KV working set by routing each query to only the most relevant blocks and maintaining local sliding windows.

The direct coupling of these two techniques, however, exposes a fundamental structural mismatch: speculative verification relies on cross-query regularity (shared committed prefix and batch KV accesses), whereas sparse attention introduces query-specific routing (unique block selection per query). This results in diminished KV-block reuse, amplified branch-wise kernel overhead, and input-dependent verification strategy requirements, thus limiting practical acceleration.

SSV Framework and Core Mechanisms

SSV proposes three principal optimizations to rectify the aforementioned mismatch and maximize hardware efficiency in speculative verification with dynamic sparse attention:

  1. Overlap-Aware Kernel Design: SSV exploits the observed high overlap in selected blocks between adjacent verifier queries. By grouping up to CC adjacent queries and deduplicating their block schedules, one can either preserve exact per-query sparse-selection semantics or deploy an approximate shared-index layout. This grouping reduces redundant KV-block loads and scheduling overhead.
  2. Refresh/Reuse-Based NSA Kernel Fusion: SSV restructures NSA kernel execution into refresh (routing-aware) and reuse (index-inherited) layers. Refresh layers recompute selected indices and fuse downstream selection and sliding-window branches, while reuse layers inherit selection layouts and deploy fully fused kernels. This minimizes kernel launches and intermediate data movement, transforming the index-selection step from a dominant overhead into a reusable execution plan.
  3. Profile-Guided Prompt-Adaptive Orchestration: SSV integrates offline profile-guided planning and online prompt-adaptive refinement for verification strategy tuples. The planner ranks configurations offline by throughput, partitioned by context regime and precision class, and enables dynamic adjustment during early decoding based on observed acceptance rates. Precision classes facilitate trade-offs between accuracy and verification speed.

Empirical Evaluation

The efficacy of SSV is substantiated through extensive experiments on NVIDIA H100 GPUs using Llama-based NSA models and integration in EAGLE-3. Key results include:

  • End-to-End Throughput: SSV achieves up to 3.49x generation throughput over autoregressive NSA decoding and substantially outperforms vanilla NSA integrated with speculative decoding frameworks. Maximum throughput is attained with reuse layers and grouped-query execution at draft-tree depth D=6D=6 and branching width k=4k=4.
  • Verification-Stage Latency: Synergistic use of reuse layers and approximate shared-index grouping results in peak verification speedups (up to 1.45x on NSA-1B, 1.23x on NSA-8B) compared to the baseline, with performance gains scaling with draft length and cross-query overlap.
  • Kernel Microbenchmarks: Kernel speedups reach 6.86x by eliminating repeated index construction and fusing execution across layers and queries. The index-selection overhead, accounting for up to 56% of kernel runtime, is effectively eliminated via reuse-layer designs.
  • Throughput-Aware Planning: Joint planning and prompt-adaptive refinement (Best+R) deliver a 14.4% throughput improvement on held-out prompts, with up to 33.2% accepted-token gain in the best context regime. Runtime refinement corrects profile choice mismatches with negligible overhead, confirming practical adaptability.

Accuracy Preservation and Approximation Modes

SSV's approximation mechanisms—including grouped-query shared-index execution and cross-layer index reuse—demonstrate negligible degradation in standard model integrity benchmarks (PIQA, HellaSwag, ARC-Easy, ARC-Challenge) even under aggressive coarsening factors (C=4C=4), validating their deployment in high-throughput settings. Precision classes allow users to specify acceptable tolerances for accuracy-preserving and controlled-approximation modes.

Implications and Future Directions

SSV's architectural design and empirical validation provide evidence for efficient integration of blockwise sparse attention into speculative decoding workflows with practical hardware-optimized kernels. The framework illustrates that overlap-aware grouping, cross-layer index reuse, and prompt-adaptive orchestration can collectively unlock substantial throughput gains in LLM inference, bridging algorithmic sparsity with hardware realization.

Theoretically, SSV's mechanisms generalize to other dynamic sparse attention backends given explicit routing metadata and sufficient cross-query overlap, opening avenues for further backend-specific kernel fusion designs. Practically, the prompt-adaptive planning paradigm establishes a robust methodology for inference-time adaptation without costly online search.

Future extensions include:

  • Adapting SSV to sparse attention variants such as DeepSeek Sparse Attention, pending backend requirement fulfillment.
  • Studying the impact of SSV in multi-user and distributed inference scenarios where cross-query overlap statistics may be more heterogeneous.
  • Investigating hardware-specific optimizations for emerging accelerators, particularly those with advanced memory hierarchies and explicit fusion capabilities.

Conclusion

SSV effectively reconciles the structural mismatch between speculative decoding and dynamic sparse attention, delivering efficient long-context LLM inference through overlap-aware kernel design, fused execution, and throughput-aware orchestration. The framework demonstrates strong empirical improvements in end-to-end throughput and kernel efficiency without compromising accuracy, underpinning scalable deployment in practical LLM serving environments (2605.19893).

Paper to Video (Beta)

No one has generated a video about this paper yet.

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Collections

Sign up for free to add this paper to one or more collections.