ScRPO: Self-correction Relative Policy Optimization
- ScRPO is a reinforcement learning framework that employs a two-stage process combining trial-and-error policy optimization with a dedicated self-correction phase.
- It integrates a Variance-Based Filter to collect high-variance errors from challenging prompts, turning mistakes into valuable learning signals.
- Empirical results show that ScRPO outperforms traditional methods like GRPO, SFT, and DPO in benchmark mathematical problem-solving tasks.
Self-correction Relative Policy Optimization (ScRPO) is a reinforcement learning framework designed to enhance LLMs in challenging mathematical problem-solving contexts by systematically leveraging self-reflection and error correction. Unlike conventional approaches that treat model mistakes as dead ends, ScRPO treats failure as informative data, channeling model-generated errors into supervised self-diagnosis to foster more robust reasoning and improved accuracy. ScRPO is characterized by a two-stage process: a trial-and-error policy optimization phase followed by a self-correction phase in which the model analyzes and corrects its own mistakes.
1. Conceptual Foundations and Motivation
ScRPO is motivated by the observation that standard policy optimization methods for LLMs, such as Proximal Policy Optimization (PPO) and Group Relative Policy Optimization (GRPO), extract only scalar rewards from generated outputs, applying each reward signal only once and discarding the rich explanatory content present in incorrect solutions. These methods thus underutilize failed chains of thought, neglecting potential learning opportunities from model-generated mistakes.
ScRPO extends GRPO in three principal ways:
- It collects an “error pool” containing high-variance mistakes encountered during policy gradient training.
- It introduces periodic prompting to force the model to analyze its own errors (self-reflection).
- It modifies the reward structure to target not only the correctness of revised solutions but also the informativeness of tokens in the “Analysis” segment of the self-reflection prompt.
This framework, which can be described as “trial-and-error plus self-correction” (Editor's term), exploits richer feedback while improving both data efficiency and the robustness of model reasoning.
2. Mathematical Formulation
2.1 Trial-and-Error Learning via GRPO
Let denote the policy of the LLM parameterized by . For each question , responses are sampled under an older policy . Each response receives a reward:
- if correct, $0$ otherwise.
The group-relative advantage for response is:
The per-token GRPO objective is:
where is the token-level importance ratio and is the KL regularization weight.
2.2 Variance-Based Filtering and Error Pool
ScRPO introduces a Variance-Based Filter (VBF) to select questions near the model’s “capability boundary”—i.e., those yielding empirical accuracies . For such prompts, if all generated answers are incorrect, pairs are added to the error pool .
2.3 Self-correction Learning Stage
At regular intervals ( in reported experiments), ScRPO suspends GRPO updates and samples a mini-batch from . For each error , the model receives a prompt of the form:
1 2 3 4 5 |
You tried solving but failed. Reflect on what went wrong. Question: ... Wrong solution: ... **Analysis:** ... **Corrected Solution:** ... |
The model generates trajectories , each containing an “Analysis” segment (tokens ) and a corrected solution. Rewards are based on the correctness of the revised answer, but in loss computation, the advantage is assigned only to “Analysis” tokens via a mask (1 for Analysis, 0 otherwise):
with to encourage exploration. This enforces targeted improvement in the reflective critique capabilities of the model.
3. Algorithmic Workflow
The ScRPO procedure is summarized in pseudocode as follows:
- Initialize: Start with policy , reference , filter bounds (, ), period ; set .
- Training Iterations: For each batch:
- Sample answers per question under .
- Compute rewards and group-relative advantages; apply GRPO update.
- Compute empirical accuracy; if and all answers are incorrect, add to ErrorPool.
- Self-correction Phase: Every iterations, if ErrorPool is non-empty:
- Sample mini-batch from ErrorPool.
- Prompt for error reflection and sample trajectories per error.
- Compute self-correction loss; update accordingly.
4. Experimental Setup and Evaluation Protocols
Model architectures utilized are DeepSeek-Distill-Qwen-1.5B and DeepSeek-Distill-Qwen-7B. Pretraining comprises 14,564 mixed-difficulty math problems from MATH and DAPO-MATH. Training employs AdamW optimizer (learning rate ), batch size $128$, samples per question, temperature $0.6$, top- of $0.95$. Experiments are run on 8 NVIDIA H200 GPUs.
Benchmarks for evaluation include GSM8k, MATH-500, AIME-2024, AMC10/12, and Olympiad. The evaluation protocol uses avg@ sampling ( for AIME/AMC, elsewhere) and reports accuracy.
5. Comparative Results
Across all considered benchmarks and both model sizes, ScRPO outperforms prior fine-tuning and post-training approaches such as SFT, DPO, GRPO, and DAPO. Accuracy improvements over the vanilla models and baseline methods are summarized in the following tables:
Accuracy (%) Improvements on Benchmarks
| Model | AIME24 Acc/Δ | AMC Acc/Δ | Olympiad Acc/Δ | GSM8k Acc/Δ | MATH-500 Acc/Δ | Avg Acc |
|---|---|---|---|---|---|---|
| Qwen-1.5B Vanilla | 28.7/— | 63.3/— | 47.0/— | 76.0/— | 79.6/— | 58.9 |
| Qwen-1.5B SFT | 27.7/-1.0 | 57.2/-6.1 | 42.5/-4.5 | 81.4/+5.4 | 78.8/-0.8 | 57.5 |
| Qwen-1.5B DPO | 24.0/-4.7 | 59.0/-4.3 | 44.0/-3.0 | 80.2/+4.2 | 78.6/-1.0 | 57.2 |
| Qwen-1.5B GRPO | 30.0/+1.3 | 66.6/+3.3 | 49.4/+2.4 | 83.5/+7.5 | 83.0/+3.4 | 62.5 |
| Qwen-1.5B DAPO | 26.7/-2.0 | 66.3/+3.0 | 46.6/-0.4 | 80.0/+4.0 | 85.8/+6.2 | 61.1 |
| Qwen-1.5B ScRPO | 34.2/+5.5 | 68.5/+5.2 | 52.0/+5.0 | 85.0/+9.0 | 84.1/+4.5 | 64.8 |
| Qwen-7B Vanilla | 51.3/— | 78.9/— | 63.1/— | 87.0/— | 92.0/— | 74.5 |
| Qwen-7B SFT | 48.7/-2.6 | 78.6/-0.3 | 62.7/-0.4 | 87.3/+0.3 | 91.4/-0.6 | 73.7 |
| Qwen-7B DPO | 53.0/+1.7 | 77.2/-1.7 | 60.4/-2.7 | 86.1/-0.9 | 90.0/-2.0 | 73.3 |
| Qwen-7B GRPO | 54.0/+2.7 | 82.1/+3.2 | 64.0/+0.9 | 90.2/+3.2 | 91.9/-0.1 | 76.4 |
| Qwen-7B DAPO | 53.3/+2.0 | 81.7/+2.8 | 63.4/+0.3 | 90.1/+3.1 | 91.6/-0.4 | 76.0 |
| Qwen-7B ScRPO | 57.5/+6.2 | 83.5/+4.6 | 65.1/+2.0 | 90.8/+3.8 | 92.3/+0.3 | 77.8 |
ScRPO achieves an average gain of +2.3 points over GRPO on Qwen-1.5B and +1.4 points on Qwen-7B.
6. Ablation Studies
Ablation studies demonstrate the criticality of two principal components:
- Variance-Based Filter (VBF): Removing this filter results in a performance drop of 1.1 points (1.5B) and 1.2 points (7B).
- Reflection-Only Loss Mask: Masking the loss to target only "Analysis" tokens yields a drop of 1.4 points (1.5B) and 1.7 points (7B) if omitted.
This suggests that both high-variance error selection and targeted analytic supervision are essential to the effectiveness of ScRPO.
7. Limitations, Extensions, and Broader Implications
ScRPO employs a synthetic error pool and a binary correctness reward, which may not capture the spectrum of error types present in real-world settings. Potential extensions include adapting ScRPO to continuous or multi-class reward schemas, application to other domains such as code generation or commonsense reasoning, and incorporating more sophisticated error representations.
Training complexity is increased due to the additional self-correction phase and calibrating period and clipping thresholds () requires careful tuning. The approach offers a paradigm shift towards models that “learn by reflecting,” suggesting reduced dependence on human-annotated datasets and promoting transparency in model error correction processes. A plausible implication is enhanced data efficiency and reliability of LLMs on tasks characterized by low external feedback.
ScRPO presents a systematic method for self-improving AI systems in challenging reasoning domains, setting a precedent for further development in leveraging machine-generated error analysis as an explicit learning signal.
Sponsored by Paperpile, the PDF & BibTeX manager trusted by top AI labs.
Get 30 days free