Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
139 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
46 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Reinforcement Learning for Reasoning in Large Language Models with One Training Example (2504.20571v2)

Published 29 Apr 2025 in cs.LG, cs.AI, and cs.CL

Abstract: We show that reinforcement learning with verifiable reward using one training example (1-shot RLVR) is effective in incentivizing the mathematical reasoning capabilities of LLMs. Applying RLVR to the base model Qwen2.5-Math-1.5B, we identify a single example that elevates model performance on MATH500 from 36.0% to 73.6%, and improves the average performance across six common mathematical reasoning benchmarks from 17.6% to 35.7%. This result matches the performance obtained using the 1.2k DeepScaleR subset (MATH500: 73.6%, average: 35.9%), which includes the aforementioned example. Furthermore, RLVR with only two examples even slightly exceeds these results (MATH500: 74.8%, average: 36.6%). Similar substantial improvements are observed across various models (Qwen2.5-Math-7B, Llama3.2-3B-Instruct, DeepSeek-R1-Distill-Qwen-1.5B), RL algorithms (GRPO and PPO), and different math examples (when employed as a single training example). In addition, we identify some interesting phenomena during 1-shot RLVR, including cross-domain generalization, increased frequency of self-reflection, and sustained test performance improvement even after the training accuracy has saturated, a phenomenon we term post-saturation generalization. Moreover, we verify that the effectiveness of 1-shot RLVR primarily arises from the policy gradient loss, distinguishing it from the "grokking" phenomenon. We also show the critical role of promoting exploration (e.g., by incorporating entropy loss with an appropriate coefficient) in 1-shot RLVR training. We also further discuss related observations about format correction, label robustness and prompt modification. These findings can inspire future work on RLVR efficiency and encourage a re-examination of recent progress and the underlying mechanisms in RLVR. Our code, model, and data are open source at https://github.com/ypwang61/One-Shot-RLVR.

Summary

  • The paper demonstrates that one-shot RLVR boosts math reasoning on LLMs from 36.0% to 73.6% on the MATH500 benchmark.
  • The method uses a duplicated single example in an RL framework with policy gradient, KL, and entropy losses to refine model performance.
  • Results reveal strong cross-domain generalization and increased self-reflection in outputs, underscoring notable data efficiency.

This paper (2504.20571) presents a surprising finding: Reinforcement Learning with Verifiable Reward (RLVR) can significantly enhance the mathematical reasoning capabilities of LLMs using only one training example. This "1-shot RLVR" setup, applied to the Qwen2.5-Math-1.5B model, dramatically improved its performance on the challenging MATH500 benchmark from 36.0% to 73.6% and its average performance across six common mathematical reasoning benchmarks from 17.6% to 35.7%. These results are comparable to or even slightly better than training with a subset of thousands of examples. The paper demonstrates that this phenomenon holds across different models (Qwen2.5-Math-7B, Llama3.2-3B-Instruct, DeepSeek-R1-Distill-Qwen-1.5B), RL algorithms (GRPO, PPO), and a variety of mathematical examples.

Implementation Details:

The core approach utilizes Reinforcement Learning with Verifiable Reward (RLVR), where the reward is a binary signal (0 or 1) indicating whether the model's generated solution's final answer is correct. The RL algorithms used are GRPO and PPO. The training process involves sampling multiple responses (e.g., 8) for a given prompt using the current policy, evaluating their correctness to obtain rewards, and then using these rewards to update the model parameters.

For the 1-shot (or few-shot) setting, the chosen single example (or few examples) is duplicated repeatedly to fill the training batch size (e.g., 128). This allows the standard RL training pipeline, which expects batches of data, to be used.

The GRPO loss function used consists of three components:

  1. Policy Gradient Loss: Encourages responses with higher rewards. It uses group-normalized advantages, where responses with above-average rewards in a batch are reinforced, and those below average are penalized. The advantage AiA_i for response oio_i is calculated as:

    Ai=rimean({r1,r2,,rG})std({r1,r2,,rG})A_i = \frac{r_i-\operatorname{mean}\bigl(\{r_1,r_2,\dots,r_G\}\bigr)}{\operatorname{std}\bigl(\{r_1,r_2,\dots,r_G\}\bigr)}

    where rir_i is the binary reward for oio_i and GG is the number of sampled responses (group size). The loss includes importance sampling and clipping terms similar to PPO.

    $\mathcal{L}_{\text{PG-GRPO}(\theta) \propto -\sum \min\Big(\frac{\pi_\theta(o| q)}{\pi_{\theta_{\text{old}(o| q)}A,\, \operatorname{clip}\bigl(\tfrac{\pi_\theta(o| q)}{\pi_{\theta_{\text{old}(o| q)}, 1-\varepsilon,\, 1+\varepsilon\bigr)A\Big)$

  2. KL Divergence Loss: Regularizes the policy update by penalizing large deviations from a reference model (often the initial base model), helping maintain general language quality.

    $\mathcal{L}_{\text{KL}(\theta, \theta_{\text{ref}) = \mathbb{D}_{\mathrm{KL}\!\left(\pi_\theta\|\pi_{\theta_\text{ref}\right)$

  3. Entropy Loss: With a negative coefficient, it encourages the model to produce more diverse reasoning paths during training rollouts.

    $\mathcal{L}_{\text{Entropy}(\theta) \propto -\mathbb{E}[\sum \text{Entropy}(\pi_\theta(\text{token}| \text{history}))]$

The total GRPO loss is a weighted sum of these components:

$\mathcal{L}_{\text{GRPO}(\theta) = \mathcal{L}_{\text{PG-GRPO}(\theta) + \beta\mathcal{L}_{\text{KL}(\theta, \theta_{\text{ref}) + \alpha\mathcal{L}_{\text{Entropy}(\theta)$

Typical hyperparameters mentioned include β=0.001\beta=0.001, α=0.001\alpha=-0.001, a training rollout temperature of 0.6, batch/mini-batch size of 128, and a learning rate of 1e-6.

Data selection for the single examples was explored using a simple historical variance score based on the variance of training accuracy across epochs on the full dataset. Examples with high historical variance were prioritized, though the paper shows that many examples, even with lower variance, can be effective.

Key Findings and Phenomena:

  1. Effectiveness & Data Efficiency: The central finding is the remarkable data efficiency. Using 1 or 2 examples achieved performance comparable to training on 1.2k or even 7.5k examples (Figure 1, 2). This was consistent across different base models and RL algorithms (Table 3).
  2. Nature of Effective Examples: The most effective single examples were not necessarily the hardest; the base model could already solve them with a non-trivial probability (Section 3.1). Training seems to activate or stabilize existing capabilities rather than teaching new knowledge.
  3. Post-saturation Generalization: A notable phenomenon observed in 1-shot RLVR is that the model's test performance continues to improve for many training steps even after the training accuracy on the single example has saturated at 100% (Figure 3). This process continues until the model eventually overfits the training example, producing multilingual gibberish mixed with correct elements, while test responses remain coherent and accurate (Figure 4).
  4. Cross-Domain Generalization: Training on a single example from one mathematical domain (e.g., Algebra or Geometry) significantly improved the model's performance across all mathematical domains, not just the domain of the training example. It was counterintuitive that improvements were not necessarily strongest in the training example's domain (Table 2, Figure 5 in Appendix). 1-shot training on math examples also improved performance on non-mathematical tasks like ARC (Table 1).
  5. Increased Self-Reflection: Training with a single example led to an increased frequency of self-reflective terms like "recheck," "rethink," and "recalculate" in the model's outputs on test data, especially in later training stages corresponding to increased response length and entropy (Figure 6).
  6. Policy Gradient is Key, Entropy Enhances: Ablation studies showed that the policy gradient loss is the primary driver of the performance improvement, distinguishing this phenomenon from "grokking" which is often attributed to regularization methods like weight decay. Adding entropy loss further enhanced performance and post-saturation generalization (Table 4, Figure 7).
  7. Entropy-Loss-Only Training & Label Robustness: Surprisingly, optimizing only the entropy loss (without any reward signal) on a single example could still yield non-trivial performance improvements (Table 4, 5). The paper also investigated label correctness, finding that minor label errors (like 12.8 vs 12.7) didn't hurt much, but overfitting to a significantly incorrect yet guessable label (like 4) led to worse performance than a completely unguessable wrong label (Table 4).

Practical Implications and Future Work:

  • Data Efficiency: The findings suggest that high data volume may not be strictly necessary for significant reasoning improvements via RLVR. This is crucial for reducing data collection costs and computational resources.
  • Data Selection: While the simple variance score worked, the observation that many examples can be effective highlights the need for more sophisticated data selection methods specifically for RLVR, potentially focusing on examples that effectively activate the model's latent reasoning abilities.
  • Understanding Mechanisms: The phenomena of post-saturation generalization and the effectiveness of policy gradient loss (and even entropy loss alone) warrant further theoretical investigation. The authors hypothesize that policy loss acts as an "implicit regularization" during exploration encouraged by entropy loss.
  • Exploration Strategies: The importance of entropy loss suggests that better methods for encouraging diverse and effective exploration during RLVR could lead to further performance gains.
  • Other Domains: The effectiveness of 1-shot RLVR should be explored in other domains like code generation.
  • Label Robustness: The findings regarding incorrect labels suggest that RLVR might be somewhat robust to minor errors, but overfitting to wrong answers is detrimental, pointing to a need for better understanding and potential mitigation strategies for noisy labels in RLVR.

The paper's code, model, and data are open source at \href{https://github.com/ypwang61/One-Shot-RLVR}{https://github.com/ypwang61/One-Shot-RLVR}.

Youtube Logo Streamline Icon: https://streamlinehq.com