Papers
Topics
Authors
Recent
Search
2000 character limit reached

Cactus: Accelerating Auto-Regressive Decoding with Constrained Acceptance Speculative Sampling

Published 5 Apr 2026 in cs.LG, cs.AI, math.OC, and stat.ML | (2604.04987v1)

Abstract: Speculative sampling (SpS) has been successful in accelerating the decoding throughput of auto-regressive LLMs by leveraging smaller draft models. SpS strictly enforces the generated distribution to match that of the verifier LLM. This is unnecessarily restrictive as slight variations of the verifier's distribution, such as sampling with top-$k$ or temperature, would also be acceptable. Typical acceptance sampling (TAS) alleviates this issue by accepting more tokens using entropy-based heuristics. However, this approach distorts the verifier distribution, potentially degrading output quality when the verifier encodes critical information. In this work, we formalize the speculative sampling algorithm through the lens of constrained optimization. Based on this formulation, we propose Cactus (constrained acceptance speculative sampling), a method that guarantees controlled divergence from the verifier distribution and increasing acceptance rates. Empirical results across a wide range of benchmarks confirm the effectiveness of our approach.

Authors (2)

Summary

  • The paper introduces a constrained optimization framework for speculative sampling to balance verifier fidelity and increased token acceptance.
  • It employs a second-order Taylor approximation on KL divergence to update acceptance rules efficiently, improving decoding throughput.
  • Experimental results show up to 1.9× speedup across multiple LLMs while maintaining high accuracy and robust output quality.

Constrained Acceptance Speculative Sampling for Accelerated Auto-Regressive Decoding

Summary and Theoretical Contributions

The paper "Cactus: Accelerating Auto-Regressive Decoding with Constrained Acceptance Speculative Sampling" (2604.04987) introduces a principled, training-free approach for speeding up auto-regressive decoding in LLMs, addressing limitations of both standard speculative sampling (SpS) and typical acceptance sampling (TAS). The work establishes a constrained optimization framework for speculative sampling, enabling a precise trade-off between distributional fidelity to the verifier and increased acceptance rates of the drafter's proposals.

The core insight is that strict distributional equivalence between draft proposals and the verifier model, as enforced by SpS, is unnecessarily restrictive. The constrained framework allows for controlled divergence (in terms of ff-divergence, particularly KL divergence) from the verifier, which is shown to both increase decoding efficiency and maintain output quality. The resulting Cactus algorithm modifies the acceptance rates and recovery strategy so that the output distribution remains within a provable divergence bound δ\delta from the verifier, while empirically achieving superior acceptance rates and throughput.

Algorithmic Advances

The paper formalizes speculative sampling as a constrained optimization problem. For each sequence step, rather than strictly matching the verifier's distribution, Cactus seeks a distribution hh maximizing acceptance of draft tokens, under the constraint Df(h∥q)≤δD_f(h \| q) \leq \delta. The solution for hh, specialized to KL divergence (yielding the Cactus algorithm), is obtained via a second-order Taylor approximation centered at the verifier's probability qq. This results in an efficient, element-wise update to the acceptance rule, accommodating large vocabulary sizes without additional memory or computational burden.

Unlike TAS, which heuristically increases acceptance at the risk of semantic drift, Cactus rigorously enforces a divergence constraint ensuring the generated outputs do not deviate excessively from the verifier’s intended output distribution. The theoretical analysis includes optimality guarantees for the achieved acceptance rate under the divergence budget, generalization to other ff-divergences, and establishes suboptimality of cross-entropy constrained variants (as in TAS).

Experimental Results

Empirical evaluation spans multiple open LLM series (Qwen 3, Gemma, DeepSeek R1, LLaMA), diverse benchmarks (GSM8K, IFEval, GPQA), and incorporates both strict-match accuracy and throughput metrics (acceptance length, rejection rate, wall-time speedup). The analysis shows:

  • Acceptance Efficiency: Cactus significantly elevates mean accepted token length (AL) compared to SpS (e.g., 7.61 vs. 5.44 on GSM8K with m=20m=20, Qwen 3 14B/0.6B, for δ=1.0\delta=1.0), and outperforms or matches TAS, while maintaining lower or equivalent rejection rates.
  • Quality-Robustness Tradeoff: Unlike TAS, whose increased acceptance can cause measurable accuracy drops on tasks sensitive to distributional shifts (e.g., GPQA), Cactus preserves and often improves strict-match accuracy, illustrating effective fidelity control.
  • Scalability: Cactus generalizes well across model sizes (up to 32B verifier), model architectures, and increasingly capable drafters (with throughput gains scaling as draft model quality improves).
  • Wall-Time Speedups: On realistic hardware (A100 GPUs), Cactus approaches 1.9×1.9\times speedup over baseline decoding and systematically beats SpS and TAS in all tested settings.

The work provides further ablations: comparison with alternative speculative sampling variants (e.g., Mentored, SpecCas, top-δ\delta0 acceptance), analysis of the acceptance-quality frontiers, and large-scale evaluation on Spec-Bench, revealing consistently favorable speedup/quality tradeoffs.

Implications and Future Directions

Cactus advances practical large-scale LLM deployment by making high-fidelity speculative sampling directly accessible without additional model-specific training. Theoretically, the constrained acceptance framework provides a unifying lens for analyzing lossless and lossy acceleration schemes. Practically, Cactus can be immediately integrated into inference engines (e.g., vLLM, HuggingFace Transformers), further accelerating LLM operations in both latency- and throughput-bound deployment scenarios.

The separation of acceptance control from drafter design allows future work to combine Cactus with other hardware- and memory-optimization techniques (FlashAttention, quantization, model distillation). An immediate research avenue is to extend divergence control to more expressive families, incorporate multiple-drafter/multi-verifier cascades, and leverage ensemble effects to enhance reliability.

The empirical utility of Cactus for applications demanding both speed and strict adherence to verifier semantics (e.g., clinical text generation, legal analysis, code synthesis) is significant. As LLM scaling laws continue to drive model size and cost upwards, flexible, robust speculative sampling will become increasingly critical for sustainable, high-performance AI inference.

Conclusion

Cactus introduces a theoretically sound, training-free method to improve speculative sampling for LLM auto-regressive decoding. The method increases acceptance rate and decoding throughput by appropriately relaxing strict distributional matching, under explicit divergence constraints, thereby reconciling output fidelity with efficiency. Experimentally, Cactus achieves state-of-the-art throughput/quality trade-offs across models, tasks, and hardware settings. This work sets a rigorous foundation for future research and development in efficient, scalable LLM deployment.

Paper to Video (Beta)

No one has generated a video about this paper yet.

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Collections

Sign up for free to add this paper to one or more collections.

Tweets

Sign up for free to view the 3 tweets with 9 likes about this paper.