- The paper introduces a constrained optimization framework for speculative sampling to balance verifier fidelity and increased token acceptance.
- It employs a second-order Taylor approximation on KL divergence to update acceptance rules efficiently, improving decoding throughput.
- Experimental results show up to 1.9× speedup across multiple LLMs while maintaining high accuracy and robust output quality.
Constrained Acceptance Speculative Sampling for Accelerated Auto-Regressive Decoding
Summary and Theoretical Contributions
The paper "Cactus: Accelerating Auto-Regressive Decoding with Constrained Acceptance Speculative Sampling" (2604.04987) introduces a principled, training-free approach for speeding up auto-regressive decoding in LLMs, addressing limitations of both standard speculative sampling (SpS) and typical acceptance sampling (TAS). The work establishes a constrained optimization framework for speculative sampling, enabling a precise trade-off between distributional fidelity to the verifier and increased acceptance rates of the drafter's proposals.
The core insight is that strict distributional equivalence between draft proposals and the verifier model, as enforced by SpS, is unnecessarily restrictive. The constrained framework allows for controlled divergence (in terms of f-divergence, particularly KL divergence) from the verifier, which is shown to both increase decoding efficiency and maintain output quality. The resulting Cactus algorithm modifies the acceptance rates and recovery strategy so that the output distribution remains within a provable divergence bound δ from the verifier, while empirically achieving superior acceptance rates and throughput.
Algorithmic Advances
The paper formalizes speculative sampling as a constrained optimization problem. For each sequence step, rather than strictly matching the verifier's distribution, Cactus seeks a distribution h maximizing acceptance of draft tokens, under the constraint Df​(h∥q)≤δ. The solution for h, specialized to KL divergence (yielding the Cactus algorithm), is obtained via a second-order Taylor approximation centered at the verifier's probability q. This results in an efficient, element-wise update to the acceptance rule, accommodating large vocabulary sizes without additional memory or computational burden.
Unlike TAS, which heuristically increases acceptance at the risk of semantic drift, Cactus rigorously enforces a divergence constraint ensuring the generated outputs do not deviate excessively from the verifier’s intended output distribution. The theoretical analysis includes optimality guarantees for the achieved acceptance rate under the divergence budget, generalization to other f-divergences, and establishes suboptimality of cross-entropy constrained variants (as in TAS).
Experimental Results
Empirical evaluation spans multiple open LLM series (Qwen 3, Gemma, DeepSeek R1, LLaMA), diverse benchmarks (GSM8K, IFEval, GPQA), and incorporates both strict-match accuracy and throughput metrics (acceptance length, rejection rate, wall-time speedup). The analysis shows:
- Acceptance Efficiency: Cactus significantly elevates mean accepted token length (AL) compared to SpS (e.g., 7.61 vs. 5.44 on GSM8K with m=20, Qwen 3 14B/0.6B, for δ=1.0), and outperforms or matches TAS, while maintaining lower or equivalent rejection rates.
- Quality-Robustness Tradeoff: Unlike TAS, whose increased acceptance can cause measurable accuracy drops on tasks sensitive to distributional shifts (e.g., GPQA), Cactus preserves and often improves strict-match accuracy, illustrating effective fidelity control.
- Scalability: Cactus generalizes well across model sizes (up to 32B verifier), model architectures, and increasingly capable drafters (with throughput gains scaling as draft model quality improves).
- Wall-Time Speedups: On realistic hardware (A100 GPUs), Cactus approaches 1.9× speedup over baseline decoding and systematically beats SpS and TAS in all tested settings.
The work provides further ablations: comparison with alternative speculative sampling variants (e.g., Mentored, SpecCas, top-δ0 acceptance), analysis of the acceptance-quality frontiers, and large-scale evaluation on Spec-Bench, revealing consistently favorable speedup/quality tradeoffs.
Implications and Future Directions
Cactus advances practical large-scale LLM deployment by making high-fidelity speculative sampling directly accessible without additional model-specific training. Theoretically, the constrained acceptance framework provides a unifying lens for analyzing lossless and lossy acceleration schemes. Practically, Cactus can be immediately integrated into inference engines (e.g., vLLM, HuggingFace Transformers), further accelerating LLM operations in both latency- and throughput-bound deployment scenarios.
The separation of acceptance control from drafter design allows future work to combine Cactus with other hardware- and memory-optimization techniques (FlashAttention, quantization, model distillation). An immediate research avenue is to extend divergence control to more expressive families, incorporate multiple-drafter/multi-verifier cascades, and leverage ensemble effects to enhance reliability.
The empirical utility of Cactus for applications demanding both speed and strict adherence to verifier semantics (e.g., clinical text generation, legal analysis, code synthesis) is significant. As LLM scaling laws continue to drive model size and cost upwards, flexible, robust speculative sampling will become increasingly critical for sustainable, high-performance AI inference.
Conclusion
Cactus introduces a theoretically sound, training-free method to improve speculative sampling for LLM auto-regressive decoding. The method increases acceptance rate and decoding throughput by appropriately relaxing strict distributional matching, under explicit divergence constraints, thereby reconciling output fidelity with efficiency. Experimentally, Cactus achieves state-of-the-art throughput/quality trade-offs across models, tasks, and hardware settings. This work sets a rigorous foundation for future research and development in efficient, scalable LLM deployment.