Reinforce-Rej: Efficient Policy Optimization

Updated 11 August 2025

Reinforce-Rej is a policy-gradient algorithm that filters homogeneous samples to focus on diverse, informative reward signals during LLM fine-tuning.
It applies token-level policy updates with a clipping operator to ensure gradual learning and improved KL efficiency.
By rejecting prompts with uniform responses, the algorithm prevents entropy collapse and promotes balanced exploration during RL post-training.

The Reinforce-Rej algorithm is a minimalist variant of policy-gradient optimization for reward-based fine-tuning of LLMs, distinguished by its explicit sample rejection mechanism. This algorithm operates by filtering out all prompts for which the candidate generation for a prompt yields responses that are uniformly either correct or incorrect. The objective is to focus optimization on informative samples, thereby improving KL efficiency and stability during RL post-training for LLM reasoning tasks.

1. Mechanism: Sample Homogeneity Filtering

Reinforce-Rej begins by generating $n$ candidate completions $a_1, \ldots, a_n$ for each prompt $x$ , assigning them binary rewards $r(x, a_i) \in \{+1, -1\}$ . Unlike conventional Reinforce or its reward-normalized extensions like GRPO, which update the policy using all observed feedback, Reinforce-Rej examines the set of rewards for each prompt. If all rewards are positive (every response correct) or all are negative (every response incorrect), the prompt and its responses are rejected from the update set.

Retained prompts $\tilde{\mathcal{D}}$ are those for which there exists heterogeneity among the sampled response rewards:

$\tilde{\mathcal{D}} = \left\{x \in \mathcal{D}: \exists\ i, j \text{ such that } r(x, a_i) \neq r(x, a_j) \right\}.$

This rejection strategy avoids policy updates on homogeneous samples, which have been shown to drive entropy collapse or produce high-variance and misleading gradients.

2. Policy Gradient Formulation and Update

For retained prompt-response pairs $(x, a)$ , Reinforce-Rej applies a token-level policy update. For autoregressive models, the policy ratio for token $t$ is

$s_t(\theta) = \frac{\pi_\theta(a_t | x, a_{1:t-1})}{\pi_{\theta_\text{old}}(a_t | x, a_{1:t-1})}$

and the clipped Reinforce loss is

$\mathcal{L}^\text{Reinforce-Rej}(\theta) = \frac{1}{|\tilde{\mathcal{D}}|} \sum_{(x, a) \in \tilde{\mathcal{D}}} \left( \frac{1}{|a|} \sum_{t=1}^{|a|} \min \left( s_t(\theta), \text{clip}(s_t(\theta), 1-\epsilon, 1+\epsilon) \cdot r(x, a) \right) \right)$

where $\text{clip}$ is a standard ratio clipping operator that bounds the policy update magnitude.

Reinforce-Rej thus performs gradient updates only on diverse, partially correct sample sets, leveraging both positive and negative reward signals (when heterogeneity is present), and avoids model updates which reinforce states associated with uniform feedback.

3. Comparison to RAFT and GRPO

RAFT trains exclusively on positive samples (all $r(x, a_i) = +1$ for retained prompts). The consequence, observed in experiments, is that policy entropy collapses prematurely—exploration is quickly depleted, causing performance to plateau.

GRPO extends Reinforce by incorporating an advantage normalization:

$A_t(x,a_i) = \frac{r_i - \text{mean}(r_1, ..., r_n)}{\text{std}(r_1, ..., r_n)}$

While GRPO uses both positive and negative samples, its experimental gains are largely attributed to its implicit exclusion of uniformly incorrect prompts. Ablations provided in the paper confirm that explicit filtering, as implemented in Reinforce-Rej, captures the bulk of GRPO’s performance advantage in a more parsimonious manner. Further, Reinforce-Rej also filters out samples with all correct responses, balancing exploration and avoiding excessive exploitation.

4. Stability, KL Efficiency, and Empirical Consequences

By discarding homogeneous prompts, Reinforce-Rej curtails updates that would induce abrupt changes in policy distribution as measured by KL divergence. KL efficiency is thus improved, with training curves exhibiting more gradual, stable dispersion from the initial policy. Stability is further enhanced by the selective rejection of harmful prompts—both those that would drive aggressive exploitation (all correct responses) and those that would incur misleading, high-variance gradients (all incorrect responses).

Entropy is preserved for longer training horizons, supporting continued exploration and reducing the risk of premature convergence observed in methods like RAFT.

5. Algorithmic Simplicity and Interpretability

Reinforce-Rej is implemented as a lightweight extension to the classical Reinforce pipeline. No additional critic networks, advantage estimators, or reward normalization steps are required beyond the sample-level filtering and standard ratio clipping. The resultant policy update mechanism is easy to interpret, as model learning focuses strictly on decision-relevant samples with reward diversity.

6. Ablation Results and Theoretical Implications

Experiments and ablations in the paper indicate:

The removal of all-wrong prompts yields significant gains, but the best performance is achieved when both all-wrong and all-correct prompts are filtered.
Reward normalization in GRPO does not confer substantial additional benefit over explicit filtering; thus, the primary driver of policy improvement is rejection of non-informative samples.
Early convergence in entropy and win-rate for RAFT is offset by stagnation at lower overall final performance.

This suggests future algorithms may profit by emphasizing principled sample selection criteria and adaptive rejection mechanisms over complex gradient normalization or critic-based sophistication. A plausible implication is that “smart” sample filtering is more effective in RL-based LLM alignment than indiscriminate gradient updates on all available samples.

7. Outlook and Research Directions

Reinforce-Rej’s paradigm underscores the importance of data curation in RL post-training for LLMs. Further research may pursue:

Adaptive filtering thresholds based on reward distribution statistics.
Incorporation of negative samples targeted to “boundary” learning cases, potentially using curriculum strategies.
Analysis of long-term exploration-exploitation tradeoffs as a function of sample rejection criteria.

Current evidence suggests that reward-based LLM post-training benefits most from robust, interpretable sample selection, with algorithmic simplicity favored over increased complexity and reliance on auxiliary network components.

In summary, Reinforce-Rej is a minimal, sample-efficient, and interpretable variant of policy-gradient RL for LLM reasoning alignment, uniquely characterized by its dual-sample rejection strategy. This algorithm advances KL stability, maintains entropy, and delivers performance competitive with more complex alternatives such as GRPO and RAFT, guiding future RLHF algorithm design toward principled data selection and filtering (Xiong et al., 15 Apr 2025).

PDF Markdown Chat (Pro)

References (1)

A Minimalist Approach to LLM Reasoning: from Rejection Sampling to Reinforce (2025)

Follow Topic

Get notified by email when new papers are published related to Reinforce-Rej Algorithm.