Reinforce-Rej: Minimalist RL for LLM Fine-Tuning

Updated 7 June 2026

Reinforce-Rej is a minimalist reinforcement learning approach that uses prompt-level filtering to discard uniformly rewarding prompts, thereby reducing variance.
It filters out prompts where all responses are either correct or incorrect, enhancing KL efficiency and maintaining balanced policy entropy.
Empirical results demonstrate that Reinforce-Rej achieves competitive accuracy and reliable convergence through stable gradient updates during LLM fine-tuning.

Reinforce-Rej is a minimalist extension to policy-gradient reinforcement learning algorithms, specifically oriented toward LLM post-training with binary reward signals. Designed to reduce variance and improve the stability of standard REINFORCE updates, Reinforce-Rej employs a prompt-level filter that discards entire prompts from the training batch if all generated responses are either correct or incorrect. This filtering mechanism achieves greater KL efficiency, improves reward learning, and maintains stable policy entropy, making it a pragmatic alternative to more complex RL methods for reward-based LLM fine-tuning (Xiong et al., 15 Apr 2025).

1. Motivation and Background

Classical policy-gradient RL algorithms such as REINFORCE optimize the objective

$J(\theta) = \mathbb{E}_{x \sim d_0} \left[ \mathbb{E}_{a \sim \pi_\theta(\cdot | x)} r(x, a) \right]$

using the update

$\nabla_\theta J(\theta) = \mathbb{E}_{(x,a) \sim \pi_\theta} [ r(x,a) \nabla_\theta \log \pi_\theta(a|x) ].$

This approach suffers from high variance, particularly when negative samples (with $r=-1$ ) dominate prompts, resulting in noisy or even harmful gradients. Techniques like PPO and GRPO add reward normalization or clipping within each prompt, but still incorporate every sample, leading to instability and excessive KL divergence from the initial policy if not well controlled (Xiong et al., 15 Apr 2025).

RAFT (Rejection Sampling Fine-Tuning) addresses this by retaining only positively rewarded samples per prompt and performing standard maximum-likelihood training. While RAFT achieves competitive performance and faster early convergence, it suffers from entropy collapse, causing limited exploration and subpar final accuracy.

Analysis of GRPO ablations revealed that GRPO's key strength is not in reward normalization but in de facto filtering: it implicitly discards prompts where all responses are incorrect, eliminating uninformative or destabilizing gradients. Reinforce-Rej generalizes this insight to a formal policy-gradient approach.

2. Formal Algorithmic Definition

The Reinforce-Rej procedure operates as follows:

Given a batch of $M$ prompts, each prompt $x_i$ is evaluated by sampling $n$ responses under the current policy. Each response receives a binary reward $r_{i,j} \in \{-1, +1\}$ .
For each prompt:
- If all $r_{i,j} = -1$ (all-wrong) or all $r_{i,j} = +1$ (all-correct), the entire prompt is discarded from the on-policy batch.
- Otherwise, all $(x_i, a_{i,j}, r_{i,j})$ pairs are retained for the policy gradient update.
The loss used is a clipped, token-level policy gradient:

$\nabla_\theta J(\theta) = \mathbb{E}_{(x,a) \sim \pi_\theta} [ r(x,a) \nabla_\theta \log \pi_\theta(a|x) ].$ 0

where $\nabla_\theta J(\theta) = \mathbb{E}_{(x,a) \sim \pi_\theta} [ r(x,a) \nabla_\theta \log \pi_\theta(a|x) ].$ 1.

The algorithm is summarized in the following pseudocode:

$\nabla_\theta J(\theta) = \mathbb{E}_{(x,a) \sim \pi_\theta} [ r(x,a) \nabla_\theta \log \pi_\theta(a|x) ].$ 4 (Xiong et al., 15 Apr 2025)

3. Theoretical Properties

Variance Reduction and KL Efficiency

By discarding prompts with unanimous (all-correct or all-wrong) reward outcomes, Reinforce-Rej removes the highest-variance, least-informative samples from the policy-gradient estimator. This leads to empirically reduced variance in gradient estimates and slower growth of KL divergence between the current and initial policies compared to PPO or GRPO (Xiong et al., 15 Apr 2025).

Stability and Convergence

Reinforce-Rej avoids the instability associated with unfiltered REINFORCE—particularly entropy collapse and KL "blow-up"—by ensuring that both positive and negative reward signals remain present while discarding batch extremes. Training converges reliably in ~200 online iterations with stable entropy and KL behavior. No formal finite-sample convergence bound is reported, but empirical results substantiate these stability claims.

Comparative Summary Table

Algorithm	Filtering Rule	Normalization	KL Control	Value Net
REINFORCE	None	No	Poor	No
PPO/GRPO	None (but reward-norm within prompt)	Yes	Moderate	Optional
RAFT	Keep only positives	N/A (MLE)	Good	No
Reinforce-Rej	Discard all-correct/all-wrong prompts	No	Very Good	No

4. Empirical Results and Comparative Performance

Experiments conducted on mathematical reasoning datasets (e.g., Numina-Math, Math500) using models such as Qwen2.5-Math-7B and LLaMA-3.2-3B demonstrate that Reinforce-Rej achieves:

Accuracy within ≲1 percentage point of GRPO, outperforming PPO.
Stronger KL efficiency: KL(πθ‖π{init}) increases more slowly than PPO or unfiltered REINFORCE.
Policy entropy under Reinforce-Rej remains ≳1.5 nats, compared to ≲0.5 nats for RAFT++ without filtering, indicating balanced exploration and exploitation (Xiong et al., 15 Apr 2025).

Main findings from ablation studies:

"Remove all wrong" filtering yields the largest reward and most stable KL/entropy curves.
"Remove all correct" filtering provides minimal additional benefit but, when combined with "all wrong," achieves stable convergence and matches the best-performing methods.
Standard deviation normalization confers negligible gain when filtering is used.

Reinforce-Rej is distinguished by its simplicity and filtering mechanism, which requires sampling multiple ( $\nabla_\theta J(\theta) = \mathbb{E}_{(x,a) \sim \pi_\theta} [ r(x,a) \nabla_\theta \log \pi_\theta(a|x) ].$ 2) responses per prompt to detect all-wrong or all-correct cases. Compared to value-function-based methods (PPO), normalization-based policy gradients (GRPO), and pure rejection sampling (RAFT), Reinforce-Rej operates without a value network and with minimal additional hyperparameters.

Other recent approaches such as REDI (Reinforcement Distillation), DPO, and SimPO introduce explicit reference models, preference margins, or sigmoid-based objectives for combining positive and negative examples, providing different trade-offs between stability, aggressiveness, and peak performance (Xu et al., 30 May 2025). Reinforce-Rej, by contrast, forgoes reference models and learns directly from the "filtered" on-policy batch.

Practical recommendations from empirical studies include:

AdamW optimizer with lr=1×10⁻⁶, prompts per iteration M=1024, responses per prompt n=4, mini-batch size 512, and clip $\nabla_\theta J(\theta) = \mathbb{E}_{(x,a) \sim \pi_\theta} [ r(x,a) \nabla_\theta \log \pi_\theta(a|x) ].$ 3 in 0.1–0.2.
Suitable for tasks with binary rewards and where unfiltered RL leads to KL instability or noisy learning.
Less effective when extreme outcome cases are rare (i.e., when rewards per prompt are not sharply bimodal).

6. Limitations and Practical Considerations

The necessity of sampling multiple responses per prompt imposes moderate computational overhead. Reinforce-Rej discards some data—if reward sparsity is high or batch size is small, this loss can slow convergence. Its minimalism means it lacks extensive reward normalization or critic-based stabilization; performance may plateau if prompt reward structure is poorly aligned with the filtering rules. A plausible implication is that further gains may require more nuanced treatment of negative samples rather than indiscriminate filtering (Xiong et al., 15 Apr 2025).

7. Significance and Prospects

Reinforce-Rej sets a new standard for minimalist, robust RL-based fine-tuning of LLMs where binary rewards and high variance are significant concerns. Its interpretability and implementation simplicity—in conjunction with empirical competitiveness—make it a recommended baseline for future work exploring RL-driven post-training, particularly for reasoning tasks. Open challenges remain in leveraging “negative” examples more productively, motivating further research into principled policy-gradient objectives that move beyond rejection sampling and filtering (Xiong et al., 15 Apr 2025).

Markdown Report Issue Upgrade to Chat

References (2)

A Minimalist Approach to LLM Reasoning: from Rejection Sampling to Reinforce (2025)

Harnessing Negative Signals: Reinforcement Distillation from Teacher Data for LLM Reasoning (2025)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Reinforce-Rej.