ScRPO: Self-correction Relative Policy Optimization

Updated 15 November 2025

ScRPO is a reinforcement learning framework that employs a two-stage process combining trial-and-error policy optimization with a dedicated self-correction phase.
It integrates a Variance-Based Filter to collect high-variance errors from challenging prompts, turning mistakes into valuable learning signals.
Empirical results show that ScRPO outperforms traditional methods like GRPO, SFT, and DPO in benchmark mathematical problem-solving tasks.

Self-correction Relative Policy Optimization (ScRPO) is a reinforcement learning framework designed to enhance LLMs in challenging mathematical problem-solving contexts by systematically leveraging self-reflection and error correction. Unlike conventional approaches that treat model mistakes as dead ends, ScRPO treats failure as informative data, channeling model-generated errors into supervised self-diagnosis to foster more robust reasoning and improved accuracy. ScRPO is characterized by a two-stage process: a trial-and-error policy optimization phase followed by a self-correction phase in which the model analyzes and corrects its own mistakes.

1. Conceptual Foundations and Motivation

ScRPO is motivated by the observation that standard policy optimization methods for LLMs, such as Proximal Policy Optimization (PPO) and Group Relative Policy Optimization (GRPO), extract only scalar rewards from generated outputs, applying each reward signal only once and discarding the rich explanatory content present in incorrect solutions. These methods thus underutilize failed chains of thought, neglecting potential learning opportunities from model-generated mistakes.

ScRPO extends GRPO in three principal ways:

It collects an “error pool” containing high-variance mistakes encountered during policy gradient training.
It introduces periodic prompting to force the model to analyze its own errors (self-reflection).
It modifies the reward structure to target not only the correctness of revised solutions but also the informativeness of tokens in the “Analysis” segment of the self-reflection prompt.

This framework, which can be described as “trial-and-error plus self-correction” (Editor's term), exploits richer feedback while improving both data efficiency and the robustness of model reasoning.

2. Mathematical Formulation

2.1 Trial-and-Error Learning via GRPO

Let $\pi_\theta$ denote the policy of the LLM parameterized by $\theta$ . For each question $q \sim P(Q)$ , $G$ responses $\{o_1, ..., o_G\}$ are sampled under an older policy $\pi_{\theta_\text{old}}$ . Each response $o_i$ receives a reward:

$R(q, o_i) = 1$ if correct, $0$ otherwise.

The group-relative advantage for response $i$ is:

$\hat{A}_i = R(q, o_i) - \frac{1}{G} \sum_{j=1}^G R(q, o_j)$

The per-token GRPO objective is:

$J_{\text{GRPO}}(\theta) = \mathbb{E}_{q,\{o_i\}} \left[\frac{1}{G} \sum_{i=1}^G \frac{1}{|o_i|} \sum_{t=1}^{|o_i|} \min(r_{i,t} \hat{A}_i, \operatorname{clip}(r_{i,t}, 1-\epsilon, 1+\epsilon) \hat{A}_i) - \beta\, D_{\text{KL}}[\pi_\theta \| \pi_{\text{ref}}] \right]$

where $r_{i,t}$ is the token-level importance ratio and $\beta$ is the KL regularization weight.

2.2 Variance-Based Filtering and Error Pool

ScRPO introduces a Variance-Based Filter (VBF) to select questions near the model’s “capability boundary”—i.e., those yielding empirical accuracies $\operatorname{acc}(q) \in (0.33, 0.66)$ . For such prompts, if all generated answers are incorrect, $(q,\text{wrong solution})$ pairs are added to the error pool $\mathbb{E}$ .

2.3 Self-correction Learning Stage

At regular intervals ( $K=5$ in reported experiments), ScRPO suspends GRPO updates and samples a mini-batch from $\mathbb{E}$ . For each error $(q, \text{wrong})$ , the model receives a prompt of the form:

You tried solving but failed. Reflect on what went wrong.
Question: ...
Wrong solution: ...
**Analysis:** ...
**Corrected Solution:** ...

The model generates $G$ trajectories $\{o_j\}$ , each containing an “Analysis” segment (tokens $a_{j,1}, ..., a_{j,T_j}$ ) and a corrected solution. Rewards are based on the correctness $c_j$ of the revised answer, but in loss computation, the advantage is assigned only to “Analysis” tokens via a mask $m_{j,i}$ (1 for Analysis, 0 otherwise):

$L_{\text{ScRPO}}(\theta) = -\mathbb{E}_{q \sim P(Q), \{o_j\} \sim \pi_{\theta_\text{old}}} \left[\frac{1}{G} \sum_{j=1}^G \frac{1}{|\mathcal{M}_j|} \sum_{i=1}^{T_j} m_{j,i} \min(r_{j,i} \hat{A}_j, \operatorname{clip}(r_{j,i}, 1-\epsilon, 1+\epsilon_{\text{high}}) \hat{A}_j) - \beta D_{\text{KL}}[\pi_\theta \| \pi_\text{ref}] \right]$

with $\epsilon_\text{high}=0.27$ to encourage exploration. This enforces targeted improvement in the reflective critique capabilities of the model.

3. Algorithmic Workflow

The ScRPO procedure is summarized in pseudocode as follows:

Initialize: Start with policy $\pi_\theta$ , reference $\pi_\text{ref}$ , filter bounds ( $\kappa_\text{low}=0.33$ , $\kappa_\text{high}=0.66$ ), period $K$ ; set $\text{ErrorPool} \leftarrow \varnothing$ .
Training Iterations: For each batch:
- Sample $G$ answers per question under $\pi_\theta$ .
- Compute rewards and group-relative advantages; apply GRPO update.
- Compute empirical accuracy; if $\operatorname{acc}(q_b) \in (\kappa_\text{low}, \kappa_\text{high})$ and all answers are incorrect, add to ErrorPool.
Self-correction Phase: Every $K$ $K$ iterations, if ErrorPool is non-empty:
- Sample mini-batch from ErrorPool.
- Prompt for error reflection and sample $G$ trajectories per error.
- Compute self-correction loss; update $\theta$ accordingly.

4. Experimental Setup and Evaluation Protocols

Model architectures utilized are DeepSeek-Distill-Qwen-1.5B and DeepSeek-Distill-Qwen-7B. Pretraining comprises 14,564 mixed-difficulty math problems from MATH and DAPO-MATH. Training employs AdamW optimizer (learning rate $1 \times 10^{-6}$ ), batch size $128$, $G=12$ samples per question, temperature $0.6$, top- $p$ of $0.95$. Experiments are run on 8 NVIDIA H200 GPUs.

Benchmarks for evaluation include GSM8k, MATH-500, AIME-2024, AMC10/12, and Olympiad. The evaluation protocol uses avg@ $k$ sampling ( $k=16$ for AIME/AMC, $k=8$ elsewhere) and reports accuracy.

5. Comparative Results

Across all considered benchmarks and both model sizes, ScRPO outperforms prior fine-tuning and post-training approaches such as SFT, DPO, GRPO, and DAPO. Accuracy improvements over the vanilla models and baseline methods are summarized in the following tables:

Accuracy (%) Improvements on Benchmarks

Model	AIME24 Acc/Δ	AMC Acc/Δ	Olympiad Acc/Δ	GSM8k Acc/Δ	MATH-500 Acc/Δ	Avg Acc
Qwen-1.5B Vanilla	28.7/—	63.3/—	47.0/—	76.0/—	79.6/—	58.9
Qwen-1.5B SFT	27.7/-1.0	57.2/-6.1	42.5/-4.5	81.4/+5.4	78.8/-0.8	57.5
Qwen-1.5B DPO	24.0/-4.7	59.0/-4.3	44.0/-3.0	80.2/+4.2	78.6/-1.0	57.2
Qwen-1.5B GRPO	30.0/+1.3	66.6/+3.3	49.4/+2.4	83.5/+7.5	83.0/+3.4	62.5
Qwen-1.5B DAPO	26.7/-2.0	66.3/+3.0	46.6/-0.4	80.0/+4.0	85.8/+6.2	61.1
Qwen-1.5B ScRPO	34.2/+5.5	68.5/+5.2	52.0/+5.0	85.0/+9.0	84.1/+4.5	64.8
Qwen-7B Vanilla	51.3/—	78.9/—	63.1/—	87.0/—	92.0/—	74.5
Qwen-7B SFT	48.7/-2.6	78.6/-0.3	62.7/-0.4	87.3/+0.3	91.4/-0.6	73.7
Qwen-7B DPO	53.0/+1.7	77.2/-1.7	60.4/-2.7	86.1/-0.9	90.0/-2.0	73.3
Qwen-7B GRPO	54.0/+2.7	82.1/+3.2	64.0/+0.9	90.2/+3.2	91.9/-0.1	76.4
Qwen-7B DAPO	53.3/+2.0	81.7/+2.8	63.4/+0.3	90.1/+3.1	91.6/-0.4	76.0
Qwen-7B ScRPO	57.5/+6.2	83.5/+4.6	65.1/+2.0	90.8/+3.8	92.3/+0.3	77.8

ScRPO achieves an average gain of +2.3 points over GRPO on Qwen-1.5B and +1.4 points on Qwen-7B.

6. Ablation Studies

Ablation studies demonstrate the criticality of two principal components:

Variance-Based Filter (VBF): Removing this filter results in a performance drop of 1.1 points (1.5B) and 1.2 points (7B).
Reflection-Only Loss Mask: Masking the loss to target only "Analysis" tokens yields a drop of 1.4 points (1.5B) and 1.7 points (7B) if omitted.

This suggests that both high-variance error selection and targeted analytic supervision are essential to the effectiveness of ScRPO.

7. Limitations, Extensions, and Broader Implications

ScRPO employs a synthetic error pool and a binary correctness reward, which may not capture the spectrum of error types present in real-world settings. Potential extensions include adapting ScRPO to continuous or multi-class reward schemas, application to other domains such as code generation or commonsense reasoning, and incorporating more sophisticated error representations.

Training complexity is increased due to the additional self-correction phase and calibrating period $K$ and clipping thresholds ( $\epsilon_{\text{high}}$ ) requires careful tuning. The approach offers a paradigm shift towards models that “learn by reflecting,” suggesting reduced dependence on human-annotated datasets and promoting transparency in model error correction processes. A plausible implication is enhanced data efficiency and reliability of LLMs on tasks characterized by low external feedback.

ScRPO presents a systematic method for self-improving AI systems in challenging reasoning domains, setting a precedent for further development in leveraging machine-generated error analysis as an explicit learning signal.

PDF Markdown Chat (Pro)

Follow Topic

Get notified by email when new papers are published related to Self-correction Relative Policy Optimization (ScRPO).