- The paper demonstrates that simple rejection-based methods like RAFT++ rival complex RL algorithms on mathematical reasoning tasks.
- It reveals that RAFT++ converges faster initially, though methods like GRPO sustain performance better during later training stages.
- The study emphasizes that effective sample filtering and balanced exploration are crucial for optimizing LLM fine-tuning via RL.
This paper (2504.11343) investigates reinforcement learning (RL) algorithms for fine-tuning LLMs on complex reasoning tasks, specifically focusing on mathematical reasoning. It contrasts the complexity of methods like Proximal Policy Optimization (PPO) with simpler approaches and analyzes the components of successful RL methods like GRPO, which was used to train models like DeepSeek-R1.
The authors revisit three key algorithms:
- Reward-Ranked Fine-Tuning (RAFT): This method, also known as rejection sampling fine-tuning, is arguably the simplest. It involves sampling multiple responses for a given prompt, evaluating them with a binary reward function (e.g., correct vs. incorrect), and then fine-tuning the LLM on the log-likelihood of only the high-reward (positive) samples. The process is iterative, with the model fine-tuned on data generated by the current policy.
- Vanilla Reinforce: A standard policy gradient algorithm that updates the model parameters θ based on the gradient of the expected reward Ea∼πθ(⋅∣x)[r(x,a)]. For autoregressive LLMs, this is typically applied at the token level, using importance sampling and clipping (similar to PPO's policy loss) to handle off-policy data collected from an older policy πθold. The loss function for the token-level approach is given by:
LReinforce(θ)=∣D∣1x,a∈D∑∣a∣1t=1∑∣a∣[min(st(θ),clip(st(θ),1−ϵ,1+ϵ))⋅r(x,a)]
where st(θ)=πθold(at∣x,a1:t−1)πθ(at∣x,a1:t−1) is the importance sampling ratio and r(x,a) is the reward for the entire response a.
- GRPO: This method is a variant of Reinforce that samples n>1 responses per prompt. Instead of using the raw reward r(x,a), it uses an advantage function At(x,ai) for the t-th token of the i-th response:
At(x,ai)=std(r1,⋯,rn)ri−mean(r1,⋯rn)
This uses the mean and standard deviation of rewards across the sampled responses for a single prompt as a form of baseline to reduce variance. The loss function is similar to token-level Reinforce, but replaces r(x,a) with this advantage function.
The authors introduce RAFT++ as a simple extension of RAFT that incorporates importance sampling and clipping, drawing inspiration from policy gradient methods but still training only on positive samples. Its loss function is:
LRAFT++(θ)=∣D∣1x,a∈D,r(x,a)=1∑∣a∣1t=1∑∣a∣[min(st(θ),clip(st(θ),1−ϵ,1+ϵ))⋅1]
where the sum is over only the positively rewarded samples (r(x,a)=1).
Experiments were conducted on mathematical reasoning benchmarks (Math500, Minerva Math, Olympiad Bench) using Qwen2.5-Math-7B-base and LLaMA-3.2-3B-instruct models. The evaluation metric was average@16 accuracy (average accuracy over 16 generated responses per prompt with temperature 1.0).
Key Findings and Practical Implications:
- RAFT is a strong baseline: Surprisingly, RAFT and its enhanced version RAFT++ achieve performance competitive with more complex RL methods like GRPO and PPO on mathematical reasoning tasks. RAFT++, by adding importance sampling and clipping, performs particularly well, approaching GRPO's accuracy (e.g., 52.5\% for RAFT++ vs 53.9\% for GRPO on Qwen2.5). This suggests that for tasks with verifiable rewards, a simple rejection sampling approach is a robust and interpretable baseline.
- Faster early convergence of RAFT++: RAFT++ shows faster initial convergence compared to GRPO. However, it is eventually surpassed by GRPO in later training stages.
- Policy entropy and exploration: Analysis indicates that training only on positive samples (RAFT++) leads to a rapid decrease in policy entropy, limiting exploration and potentially causing the performance plateau observed later in training. GRPO, by incorporating negative samples, maintains higher entropy and continues to improve. The "clip higher" technique (asymmetric clipping) from previous work, when applied to RAFT++, helps stabilize entropy and improves later-stage performance, further supporting the link between exploration and sustained learning.
- The crucial role of sample filtering in GRPO: Ablation studies on Reinforce variants reveal that GRPO's superior performance over vanilla Reinforce is primarily due to its implicit filtering of prompts where all sampled responses are incorrect. Training on such prompts significantly harms performance. Removing prompts with only correct responses has less impact.
- Reward normalization is less important: The ablation studies show that reward normalization techniques used in GRPO (subtracting mean, dividing by standard deviation) provide minimal additional gain compared to simply filtering harmful samples. This suggests that the sample selection mechanism is more critical than the specific normalization scheme for GRPO's success on this task.
- Reinforce-Rej: A minimalist alternative: Motivated by the importance of filtering harmful samples, the authors propose Reinforce-Rej. This variant of Reinforce explicitly filters out prompts where all sampled responses are either entirely correct or entirely incorrect. This approach yields comparable final performance to GRPO while demonstrating improved KL efficiency and stability, offering a lightweight yet effective alternative to more complex RL algorithms.
Implementation Considerations:
- The methods rely on the ability to sample multiple responses per prompt and evaluate them with a reward function (a verifier for math tasks).
- Importance sampling and clipping are beneficial for stabilizing off-policy training, even for relatively simple methods like RAFT (leading to RAFT++). The clipping threshold ϵ is a hyperparameter to tune.
- The choice of samples included in the training batch significantly impacts performance and training dynamics. Filtering out samples or prompts with low-quality (especially entirely incorrect) responses appears critical.
- Balancing exploration and exploitation is important; methods that collapse policy entropy too quickly may plateau. Techniques like broader clipping ranges can help maintain exploration.
- Computational requirements involve sampling n responses per prompt and computing rewards for all of them. For GRPO and Reinforce variants, gradients are computed over tokens, while RAFT performs standard fine-tuning on selected trajectories.
In summary, the paper demonstrates that simple, rejection-based methods like RAFT are surprisingly effective. A key insight is that selectively filtering low-quality samples, particularly entirely incorrect responses, is a major driver of performance gains in reward-based LLM post-training algorithms like GRPO, outweighing the impact of reward normalization. The proposed Reinforce-Rej algorithm, which filters both entirely correct and incorrect prompts, provides a lightweight and efficient alternative that leverages this insight. The findings underscore the importance of principled sample selection in reward-based fine-tuning for LLMs. The code for the project is available at \url{https://github.com/RLHFlow/Minimal-RL}.