RGR-GRPO: Rubric-Driven RL for Multi-Domain Reasoning
- The paper introduces RGR-GRPO, which leverages detailed rubrics as dense rewards to overcome exploration collapse in multi-domain reinforcement learning tasks.
- It extends group-based policy optimization by integrating off-policy self-refinement to address sparse rewards and stabilize policy entropy.
- Empirical evaluations across 14 datasets in math, physics, chemistry, and general reasoning reveal significant performance gains over standard RL baselines.
RGR-GRPO (Reward and Guidance through Rubrics – Group Relative Policy Optimization) is a rubric-driven reinforcement learning (RL) framework that extends group-based policy optimization approaches for multi-domain reasoning, especially in LLMs. By leveraging fine-grained, question-specific rubrics as both dense reward sources and off-policy guidance, RGR-GRPO addresses the exploration and reward sparsity limitations of standard verifiable-reward RL. The method consistently yields superior performance and exploration dynamics across mathematics, sciences, and general reasoning tasks (Bi et al., 15 Nov 2025).
1. Motivation and Problem Setting
Conventional RLHF and RLVR pipelines for LLM reasoning have two core deficiencies in multi-domain settings:
- Sparse, domain-limited rewards: Most prior RLHF and RLVR approaches use binary, sparse correctness rewards, which are only available in domains with verifiable solutions (such as math). Scientific and open-domain tasks often lack such automatic verifiers.
- Exploration collapse: Purely on-policy optimization (e.g., PPO or GRPO) tends to rapidly reduce policy entropy, leading to premature convergence and poor exploration. This narrows the solution search space and prevents the discovery of higher-quality reasoning strategies.
RGR-GRPO addresses these shortcomings by constructing per-question rubrics that both (a) provide meaningful, dense, and interpretable rewards for intermediate steps, and (b) support targeted off-policy self-refinement when on-policy rollouts fail rubric checks (Bi et al., 15 Nov 2025).
2. Formal Definition and Mathematical Structure
2.1 Group Relative Policy Optimization (GRPO) Foundation
Let denote a training prompt. For each prompt, a group of answers is sampled from . Each answer receives a scalar reward . The group-relative advantage is computed as: The clipped surrogate policy objective is: where .
2.2 Rubric-Driven Reward Specification
Each question is associated with a rubric , where each 0 is a testable criterion and 1 is a positive weight. For any generated answer 2, the 3th binary signal is: 4 The dense, normalized rubric reward: 5 If all factual criteria in 6 are satisfied, 7 is assigned; otherwise, the weighted average across all criteria is used.
2.3 RGR-GRPO Mixed On-/Off-policy Update
For each prompt 8, 9 on-policy samples 0 are scored, and 1. If 2 passes all rubric criteria, standard GRPO is performed with a final on-policy sample. Otherwise, failed criteria 3 guide self-refinement by generating 4. The mixed-policy update objective is: 5 with 6 and 7.
3. Rubric Construction and Integration
Rubrification proceeds in two LLM-assisted stages:
- Reference answer creation: A strong “expert” LLM produces a canonical solution 8 for each prompt 9.
- Rubric extraction: An LLM extracts 3–10 rubric items 0 from 1, partitioned into Factual (final/intermediate facts) and Process (solution steps).
During training, an “LLM-as-judge” applies each rubric criterion to candidate rollouts, producing a dense reward signal for each sample.
4. Training Algorithm and Implementation Details
Main training protocol:
- Batch size: typically 96 prompts per update.
- Sampling: For each prompt, 2 rollouts (7 on-policy, 1 off-policy for failed rubrics).
- Rewards: Rubric scores as above.
- On-policy and mixed on/off-policy updates as described in Section 2.
- Optimizer: Adam with learning rate 3.
- Evaluation: Greedy decoding, exact match (EM) and Pass@4 metrics.
Algorithmic pseudocode (simplified):
- For each prompt 5, sample 6 trajectories, compute rubric rewards, select 7.
- If 8 passes rubric, sample 9 on-policy and proceed with standard GRPO.
- Else, generate 0 via rubric-guided refinement, compute reward, and update via mixed-policy objective as above.
- Update model parameters via gradient ascent on 1.
5. Empirical Evaluation and Performance
RGR-GRPO has been comprehensively benchmarked across 14 multi-domain reasoning datasets spanning mathematics, physics, chemistry, and general tasks (Bi et al., 15 Nov 2025). Results, summarized in the following table, demonstrate robust outperformance over both verifiable-outcome RL baselines and alternative rubric-based or critique-based schemes.
| Method | Math (↑) | Phys (↑) | Chem (↑) | Gen (↑) | Overall (↑) |
|---|---|---|---|---|---|
| Outcome-GRPO | 61.2 | 47.0 | 47.0 | 52.3 | 52.3 |
| Rubric-GRPO | 65.3 | 47.9 | 47.9 | 53.7 | 53.7 |
| Critique-GRPO | 61.2 | 45.7 | 45.7 | 50.6 | 50.6 |
| LUFFY | 60.5 | 46.4 | 46.4 | 50.5 | 50.5 |
| RGR-GRPO | 67.9 | 48.8 | 48.8 | 55.8 | 55.8 |
RGR-GRPO delivers average improvements of +7.0% (math), +5.4% (physics), +8.4% (chemistry), and +6.6% (general tasks) over the Outcome-GRPO baseline.
6. Dynamics, Ablations, and Theoretical Properties
Exploration and Entropy
RGR-GRPO maintains stable entropy levels in [0.2, 0.4] throughout training, avoiding the entropy collapse characteristic of pure on-policy methods (which fall to ≈0.1). This signifies sustained and diversified exploration.
Ablation Findings
- Removing Exploration Assessment (EA) destabilizes training and degrades test accuracy by −1.0 points.
- Omitting the off-policy shaping term 2 reduces performance by −1.3 points.
- Restricting the rubric to factual criteria alone causes a −0.6 point drop.
All rubric-driven components are critical for realizing the full multi-domain improvements.
Robustness and Generality
The framework maintains superior pass@3 extrapolation (i.e., high-quality diverse outputs as 4 grows). Gains persist across all 14 tested datasets, including out-of-distribution scientific and general reasoning domains.
7. Context and Comparative Significance
RGR-GRPO generalizes group-based policy optimization to settings lacking reliable verifiers, bridging dense, interpretable, and domain-adaptive reward shaping with robust, hybrid on-/off-policy exploration. It strictly improves over baselines such as Rubric-GRPO (rubric reward used with on-policy-only GRPO), critique augmentation, and alternative hybrid RL schemes. The method does not rely on reward model calibration or noise correction as in noise-corrected GRPO variants (mansouri et al., 21 Oct 2025), but introduces algorithmic machinery specifically tailored to leveraging rubrics as both reward and guidance. A plausible implication is that as LLM benchmarks grow in their demand for cross-domain and higher-order reasoning, rubric-guided frameworks such as RGR-GRPO are likely to become standard practice for exploration-aware training (Bi et al., 15 Nov 2025).