A Minimalist Approach to LLM Reasoning: from Rejection Sampling to Reinforce (2504.11343v2)

Published 15 Apr 2025 in cs.LG, cs.AI, cs.CL, and stat.ML

Abstract: Reinforcement learning (RL) has become a prevailing approach for fine-tuning LLMs on complex reasoning tasks. Among recent methods, GRPO stands out for its empirical success in training models such as DeepSeek-R1, yet the sources of its effectiveness remain poorly understood. In this work, we revisit GRPO from a reinforce-like algorithm perspective and analyze its core components. Surprisingly, we find that a simple rejection sampling baseline, RAFT, which trains only on positively rewarded samples, yields competitive performance than GRPO and PPO. Our ablation studies reveal that GRPO's main advantage arises from discarding prompts with entirely incorrect responses, rather than from its reward normalization. Motivated by this insight, we propose Reinforce-Rej, a minimal extension of policy gradient that filters both entirely incorrect and entirely correct samples. Reinforce-Rej improves KL efficiency and stability, serving as a lightweight yet effective alternative to more complex RL algorithms. We advocate RAFT as a robust and interpretable baseline, and suggest that future advances should focus on more principled designs for incorporating negative samples, rather than relying on them indiscriminately. Our findings provide guidance for future work in reward-based LLM post-training.

Summary

The paper demonstrates that simple rejection-based methods like RAFT++ rival complex RL algorithms on mathematical reasoning tasks.
It reveals that RAFT++ converges faster initially, though methods like GRPO sustain performance better during later training stages.
The study emphasizes that effective sample filtering and balanced exploration are crucial for optimizing LLM fine-tuning via RL.

This paper (2504.11343) investigates reinforcement learning (RL) algorithms for fine-tuning LLMs on complex reasoning tasks, specifically focusing on mathematical reasoning. It contrasts the complexity of methods like Proximal Policy Optimization (PPO) with simpler approaches and analyzes the components of successful RL methods like GRPO, which was used to train models like DeepSeek-R1.

The authors revisit three key algorithms:

Reward-Ranked Fine-Tuning (RAFT): This method, also known as rejection sampling fine-tuning, is arguably the simplest. It involves sampling multiple responses for a given prompt, evaluating them with a binary reward function (e.g., correct vs. incorrect), and then fine-tuning the LLM on the log-likelihood of only the high-reward (positive) samples. The process is iterative, with the model fine-tuned on data generated by the current policy.
Vanilla Reinforce: A standard policy gradient algorithm that updates the model parameters $\theta$ based on the gradient of the expected reward $E_{a \sim \pi_\theta(\cdot|x)} [r(x, a)]$ . For autoregressive LLMs, this is typically applied at the token level, using importance sampling and clipping (similar to PPO's policy loss) to handle off-policy data collected from an older policy $\pi_{\theta_{\text{old}}}$ . The loss function for the token-level approach is given by:

$\mathcal{L}^{\text{Reinforce}}(\theta) = \frac{1}{|D|} \sum_{x,a \in \mathcal{D}} \frac{1}{|a|}\sum_{t=1}^{|a|} \Big[ \min\Big(s_t(\theta), \mathrm{clip}(s_t(\theta), 1-\epsilon,1+\epsilon)\Big)\cdot r(x,a)\Big]$

where $s_t(\theta) = \frac{\pi_{\theta}(a_t|x, a_{1:t-1})}{\pi_{\theta_{\text{old}}}(a_t|x, a_{1:t-1})}$ is the importance sampling ratio and $r(x,a)$ is the reward for the entire response $a$ .
GRPO: This method is a variant of Reinforce that samples $n > 1$ responses per prompt. Instead of using the raw reward $r(x,a)$ , it uses an advantage function $A_t(x,a_i)$ for the $t$ -th token of the $i$ -th response:

$A_{t}(x,a_i) = \frac{r_{i} - \mathrm{mean} (r_1, \cdots r_n)}{\mathrm{std}(r_1,\cdots,r_n)}$

This uses the mean and standard deviation of rewards across the sampled responses for a single prompt as a form of baseline to reduce variance. The loss function is similar to token-level Reinforce, but replaces $r(x,a)$ with this advantage function.

The authors introduce RAFT++ as a simple extension of RAFT that incorporates importance sampling and clipping, drawing inspiration from policy gradient methods but still training only on positive samples. Its loss function is:

$\mathcal{L}^{\text{RAFT++}}(\theta) = \frac{1}{|D|} \sum_{x,a \in \mathcal{D}, r(x,a)=1} \frac{1}{|a|}\sum_{t=1}^{|a|} \Big[ \min\Big(s_t(\theta), \mathrm{clip}(s_t(\theta), 1-\epsilon,1+\epsilon)\Big)\cdot 1 \Big]$

where the sum is over only the positively rewarded samples ( $r(x,a)=1$ ).

Experiments were conducted on mathematical reasoning benchmarks (Math500, Minerva Math, Olympiad Bench) using Qwen2.5-Math-7B-base and LLaMA-3.2-3B-instruct models. The evaluation metric was average@16 accuracy (average accuracy over 16 generated responses per prompt with temperature 1.0).

Key Findings and Practical Implications:

RAFT is a strong baseline: Surprisingly, RAFT and its enhanced version RAFT++ achieve performance competitive with more complex RL methods like GRPO and PPO on mathematical reasoning tasks. RAFT++, by adding importance sampling and clipping, performs particularly well, approaching GRPO's accuracy (e.g., 52.5\% for RAFT++ vs 53.9\% for GRPO on Qwen2.5). This suggests that for tasks with verifiable rewards, a simple rejection sampling approach is a robust and interpretable baseline.
Faster early convergence of RAFT++: RAFT++ shows faster initial convergence compared to GRPO. However, it is eventually surpassed by GRPO in later training stages.
Policy entropy and exploration: Analysis indicates that training only on positive samples (RAFT++) leads to a rapid decrease in policy entropy, limiting exploration and potentially causing the performance plateau observed later in training. GRPO, by incorporating negative samples, maintains higher entropy and continues to improve. The "clip higher" technique (asymmetric clipping) from previous work, when applied to RAFT++, helps stabilize entropy and improves later-stage performance, further supporting the link between exploration and sustained learning.
The crucial role of sample filtering in GRPO: Ablation studies on Reinforce variants reveal that GRPO's superior performance over vanilla Reinforce is primarily due to its implicit filtering of prompts where all sampled responses are incorrect. Training on such prompts significantly harms performance. Removing prompts with only correct responses has less impact.
Reward normalization is less important: The ablation studies show that reward normalization techniques used in GRPO (subtracting mean, dividing by standard deviation) provide minimal additional gain compared to simply filtering harmful samples. This suggests that the sample selection mechanism is more critical than the specific normalization scheme for GRPO's success on this task.
Reinforce-Rej: A minimalist alternative: Motivated by the importance of filtering harmful samples, the authors propose Reinforce-Rej. This variant of Reinforce explicitly filters out prompts where all sampled responses are either entirely correct or entirely incorrect. This approach yields comparable final performance to GRPO while demonstrating improved KL efficiency and stability, offering a lightweight yet effective alternative to more complex RL algorithms.

Implementation Considerations:

The methods rely on the ability to sample multiple responses per prompt and evaluate them with a reward function (a verifier for math tasks).
Importance sampling and clipping are beneficial for stabilizing off-policy training, even for relatively simple methods like RAFT (leading to RAFT++). The clipping threshold $\epsilon$ is a hyperparameter to tune.
The choice of samples included in the training batch significantly impacts performance and training dynamics. Filtering out samples or prompts with low-quality (especially entirely incorrect) responses appears critical.
Balancing exploration and exploitation is important; methods that collapse policy entropy too quickly may plateau. Techniques like broader clipping ranges can help maintain exploration.
Computational requirements involve sampling $n$ responses per prompt and computing rewards for all of them. For GRPO and Reinforce variants, gradients are computed over tokens, while RAFT performs standard fine-tuning on selected trajectories.

In summary, the paper demonstrates that simple, rejection-based methods like RAFT are surprisingly effective. A key insight is that selectively filtering low-quality samples, particularly entirely incorrect responses, is a major driver of performance gains in reward-based LLM post-training algorithms like GRPO, outweighing the impact of reward normalization. The proposed Reinforce-Rej algorithm, which filters both entirely correct and incorrect prompts, provides a lightweight and efficient alternative that leverages this insight. The findings underscore the importance of principled sample selection in reward-based fine-tuning for LLMs. The code for the project is available at \url{https://github.com/RLHFlow/Minimal-RL}.

PDF Markdown

Related Papers

Find Related Papers

Tweets

https://twitter.com/hendrydong/status/1912485845703225853

https://twitter.com/SFResearch/status/1917285597204394179

https://twitter.com/TheTuringPost/status/1915207508261822942

https://twitter.com/fly51fly/status/1912620609881928104

https://twitter.com/arxivsanitybot/status/1912698585675681862