Spurious Rewards: Rethinking Training Signals in RLVR (2506.10947v1)

Published 12 Jun 2025 in cs.AI and cs.LG

Abstract: We show that reinforcement learning with verifiable rewards (RLVR) can elicit strong mathematical reasoning in certain models even with spurious rewards that have little, no, or even negative correlation with the correct answer. For example, RLVR improves MATH-500 performance for Qwen2.5-Math-7B in absolute points by 21.4% (random reward), 13.8% (format reward), 24.1% (incorrect label), 26.0% (1-shot RL), and 27.1% (majority voting) -- nearly matching the 29.1% gained with ground truth rewards. However, the spurious rewards that work for Qwen often fail to yield gains with other model families like Llama3 or OLMo2. In particular, we find code reasoning -- thinking in code without actual code execution -- to be a distinctive Qwen2.5-Math behavior that becomes significantly more frequent after RLVR, from 65% to over 90%, even with spurious rewards. Overall, we hypothesize that, given the lack of useful reward signal, RLVR must somehow be surfacing useful reasoning representations learned during pretraining, although the exact mechanism remains a topic for future work. We suggest that future RLVR research should possibly be validated on diverse models rather than a single de facto choice, as we show that it is easy to get significant performance gains on Qwen models even with completely spurious reward signals.

Summary

The paper demonstrates that RLVR with spurious reward signals, including random, format, and incorrect rewards, can significantly improve performance in Qwen2.5-Math models.
It reveals that increased reliance on code reasoning—rising from 65% to over 90%—drives accuracy gains on benchmarks like MATH-500.
The study warns that these benefits are highly model-dependent, underscoring the need for evaluation across diverse model families.

This paper, "Spurious Rewards: Rethinking Training Signals in RLVR" (2506.10947), investigates the effectiveness of reinforcement learning with verifiable rewards (RLVR) when using weak or "spurious" reward signals, specifically focusing on their impact on LLMs for mathematical reasoning tasks. The central, counterintuitive finding is that RLVR can significantly improve the performance of certain models, particularly the Qwen2.5-Math family, even when the rewards are random, based on incorrect labels, or only check for output format.

The authors design a series of binary (0-1) reward functions that progressively weaken the signal quality:

Ground Truth Rewards: Standard RLVR, rewarding verifiably correct answers. Serves as a baseline.
Majority Vote Rewards: Rewards based on whether the model's output matches the majority answer from multiple samples on the training data (pseudo-labeling).
Format Rewards: Rewards only based on whether the output contains the expected formatting element (e.g., \boxed{}), irrespective of mathematical correctness.
Random Rewards: Rewards assigned randomly (Bernoulli distribution) independent of the output content.
Incorrect Rewards: Rewards based on whether the output matches pre-computed incorrect labels for the training data.

Using the GRPO [deepseekteam2024deepseek] algorithm for RLVR, the paper shows that training Qwen2.5-Math models (1.5B and 7B parameters) with these spurious rewards yields substantial performance gains on benchmarks like MATH-500, AMC, and AIME (detailed results in Appendix A). For instance, on MATH-500, training Qwen2.5-Math-7B with incorrect rewards resulted in a 24.1% absolute accuracy gain, compared to 29.1% with ground truth rewards. Even random rewards produced a 21.4% gain. This suggests that, at least for Qwen2.5-Math, RLVR might be eliciting useful pre-existing capabilities rather than teaching new ones based on precise reward information.

However, this effect does not generalize to other model families. The paper extends the experiments to general-purpose Qwen2.5 variants, Llama3.1/3.2 models, and OLMo2 models. Results on these models show minimal or even negative performance changes when trained with spurious rewards. Only training with ground truth or high-quality pseudo-labels (majority vote) consistently improves non-Qwen models. This highlights that the effectiveness of spurious rewards is highly model-dependent, likely due to differences in pretraining distributions and learned behaviors. The authors issue a practical warning based on this finding: RLVR research validated solely on Qwen models might not generalize, and evaluation should be conducted across diverse model families.

To understand the discrepancy, the paper analyzes the reasoning strategies employed by different models, focusing on Qwen2.5-Math-7B. They identify "code reasoning"—the model generating Python code to aid its calculations, despite not using a real interpreter—as a prevalent and highly effective strategy in Qwen2.5-Math models before RLVR training. Responses using code reasoning were significantly more accurate than those using natural language alone (60.9% vs. 35.0% on MATH-500 for Qwen2.5-Math-7B).

Tracing the training process, the authors found that RLVR with spurious rewards strongly correlates with an increase in code reasoning frequency (from 65% to over 90%) and overall accuracy. This supports the hypothesis that spurious rewards in Qwen2.5-Math primarily work by upweighting this beneficial pre-existing behavior. A detailed analysis using "reasoning strategy switches" shows that a large portion of the performance gain comes from problems where the model switches from natural language to code reasoning after RLVR (58.3% of the gain for Qwen2.5-Math-7B).

To further validate this, the authors intervened on code reasoning frequency:

Inducing code reasoning: Prompting models to use Python or training them with a reward for generating "python" significantly increased accuracy on Qwen2.5-Math models but often degraded performance on others.
Inhibiting code reasoning: Training with compound rewards (e.g., Format + no Python) reduced gains on Qwen2.5-Math-7B when compared to the original spurious reward, though some gains persisted (potentially due to eliciting other beneficial patterns like reduced repetition, see Appendix D). For models where code reasoning was ineffective ("Bad-Code" models like OLMo2-7B-SFT), inhibiting code reasoning with compound rewards actually improved performance.

The paper also explores how random rewards can provide a training signal despite being information-free. They analyze GRPO's clipping mechanism, showing that it introduces a bias that encourages the policy to concentrate probability mass on behaviors that were already high-probability under the old policy ( $\pi_{\text{old}}$ ). This "clipping bias" provides a non-zero expected gradient even with random advantages. For Qwen2.5-Math, this means high-probability behaviors like code reasoning get reinforced, leading to performance gains. Ablating the clipping term empirically shows that random rewards no longer provide consistent improvements without this bias.

In conclusion, the paper demonstrates that RLVR on certain models, like Qwen2.5-Math, can surprisingly improve performance even with weak or spurious rewards by amplifying pre-existing, useful reasoning strategies (like code reasoning). This effect is not universal and depends heavily on the base model's priors. The findings suggest that future RLVR research should be rigorously evaluated on diverse models and consider how optimization biases might interact with pre-trained capabilities.

PDF Markdown

Related Papers

Find Related Papers

Tweets

https://twitter.com/RulinShao/status/1933552634881061302

https://twitter.com/chaumian/status/1934015165017493719

https://twitter.com/MZSun001/status/1935641768076419491

https://twitter.com/arxivsanitybot/status/1933719018063757317

https://twitter.com/ZainHasan6/status/1933765937108299794

https://twitter.com/burny_tech/status/1937600707559064055

YouTube

Show All Videos