Papers
Topics
Authors
Recent
Assistant
AI Research Assistant
Well-researched responses based on relevant abstracts and paper content.
Custom Instructions Pro
Preferences or requirements that you'd like Emergent Mind to consider when generating responses.
Gemini 2.5 Flash
Gemini 2.5 Flash 71 tok/s
Gemini 2.5 Pro 54 tok/s Pro
GPT-5 Medium 22 tok/s Pro
GPT-5 High 29 tok/s Pro
GPT-4o 88 tok/s Pro
Kimi K2 138 tok/s Pro
GPT OSS 120B 446 tok/s Pro
Claude Sonnet 4.5 35 tok/s Pro
2000 character limit reached

Reward-Guided Speculative Decoding for Efficient LLM Reasoning (2501.19324v3)

Published 31 Jan 2025 in cs.CL and cs.AI

Abstract: We introduce Reward-Guided Speculative Decoding (RSD), a novel framework aimed at improving the efficiency of inference in LLMs. RSD synergistically combines a lightweight draft model with a more powerful target model, incorporating a controlled bias to prioritize high-reward outputs, in contrast to existing speculative decoding methods that enforce strict unbiasedness. RSD employs a process reward model to evaluate intermediate decoding steps and dynamically decide whether to invoke the target model, optimizing the trade-off between computational cost and output quality. We theoretically demonstrate that a threshold-based mixture strategy achieves an optimal balance between resource utilization and performance. Extensive evaluations on challenging reasoning benchmarks, including Olympiad-level tasks, show that RSD delivers significant efficiency gains against decoding with the target model only (up to 4.4x fewer FLOPs), while achieving significant better accuracy than parallel decoding method on average (up to +3.5). These results highlight RSD as a robust and cost-effective approach for deploying LLMs in resource-intensive scenarios. The code is available at https://github.com/BaohaoLiao/RSD.

Summary

  • The paper introduces Reward-Guided Speculative Decoding (RSD) to combine draft and target models using reward evaluations for efficient LLM reasoning.
  • It employs a threshold-based mixture strategy that reduces computational cost by up to 4.4x while boosting accuracy by 3.5 points on reasoning benchmarks.
  • Experimental results on tasks like MATH500 demonstrate the approach's scalability and robustness in managing resource-intensive inference scenarios.

Reward-Guided Speculative Decoding for Efficient LLM Reasoning

Introduction

The paper introduces Reward-Guided Speculative Decoding (RSD), a framework that enhances efficiency in LLM inference by combining a draft model and a target model with a focus on reward signals. This approach contrasts with standard speculative decoding (SD), which relies on strict token matching, often leading to computational inefficiency when tokens are mismatched. RSD employs a reward model to evaluate draft outputs and decide whether the target model should be invoked, optimizing the trade-off between computational cost and output quality. Extensive evaluations demonstrate that RSD provides significant efficiency gains over traditional methods, achieving up to 4.4x fewer FLOPs while improving accuracy by up to 3.5 points on reasoning benchmarks. Figure 1

Figure 1: Reward-Guided Speculative Decoding (RSD) improves efficiency by refining draft outputs based on reward signals.

Methodology

RSD integrates a lightweight draft model with a more capable target model, prioritizing high-reward outputs through a controlled bias. Unlike traditional SD methods that enforce strict unbiasedness, RSD employs a reward model to adaptively select high-value draft outputs.

The process begins with a draft model generating candidate steps, which are evaluated using a reward function. If a step's reward score is sufficiently high, it is accepted to continue the reasoning trajectory. Otherwise, the target model is invoked to refine the outputs. This method allows for greater flexibility in exploring diverse completions, reducing the overhead typically associated with strict token matching.

The paper also presents a theoretical framework for RSD, demonstrating that a threshold-based mixture strategy achieves an optimal balance between efficiency and performance. The mixture distribution PRSD\mathbf{P}_{\text{RSD}} balances contributions from both models, guided by a dynamically adjusted weighting function.

RSD Algorithm

The RSD algorithm involves the following steps:

  1. Draft Step Generation: The draft model generates a candidate step given the prompt and previous outputs.
  2. Reward Computation: The reward function evaluates the quality of this candidate step.
  3. Adaptive Acceptance: If the candidate's reward score meets a predefined criterion, the step is accepted. Otherwise, the target model generates a substitute step.
  4. Iteration: This process iterates until an end-of-sequence (EOS) token is generated or the maximum sequence length is reached.

Technical Implementation

The weighting function ω(r)\omega(r) determines the contribution of the draft model within the mixture distribution. Several variants of ω(r)\omega(r) are proposed, including constant, binary, clipping, sigmoidal, and logistic transformations. A key insight is that the optimal weighting function maximizes the reward by selectively using outputs of the draft model for high-reward regions, while the target model serves as a fallback for low-quality outputs. Figure 2

Figure 2: Comparison of reward scores and winning rates between draft and target models within the RSD framework.

Performance Evaluation

RSD outperforms traditional speculative decoding and Best-of-NN techniques across multiple reasoning benchmarks such as MATH500 and Olympiad-level tasks. The framework not only improves reasoning accuracy but also significantly reduces computational costs. The approach demonstrates versatility across different model configurations, showcasing robustness and scalability.

Efficiency Gains

Experimental results highlight that RSD with a 7B draft model and a 72B target model achieves higher accuracy than a 72B target model alone, with a computational cost reduction of up to 4.4x, demonstrating the efficiency and practicality of RSD in real-world applications. Figure 3

Figure 3: Flops vs. accuracy on MATH500, illustrating RSD's superior efficiency and performance.

Conclusion

Reward-Guided Speculative Decoding presents a promising advancement in LLM inference, offering a scalable solution that balances computational efficiency with reasoning accuracy. This framework is well-suited for resource-intensive scenarios, providing substantial performance improvements while minimizing overhead. Future work may explore the integration of more sophisticated reward models and the extension of RSD to other complex tasks and domains.

List To Do Tasks Checklist Streamline Icon: https://streamlinehq.com

Collections

Sign up for free to add this paper to one or more collections.

X Twitter Logo Streamline Icon: https://streamlinehq.com

Tweets

This paper has been mentioned in 9 posts and received 130 likes.