RGR-GRPO: Rubric-Driven RL for Multi-Domain Reasoning

Updated 14 June 2026

The paper introduces RGR-GRPO, which leverages detailed rubrics as dense rewards to overcome exploration collapse in multi-domain reinforcement learning tasks.
It extends group-based policy optimization by integrating off-policy self-refinement to address sparse rewards and stabilize policy entropy.
Empirical evaluations across 14 datasets in math, physics, chemistry, and general reasoning reveal significant performance gains over standard RL baselines.

RGR-GRPO (Reward and Guidance through Rubrics – Group Relative Policy Optimization) is a rubric-driven reinforcement learning (RL) framework that extends group-based policy optimization approaches for multi-domain reasoning, especially in LLMs. By leveraging fine-grained, question-specific rubrics as both dense reward sources and off-policy guidance, RGR-GRPO addresses the exploration and reward sparsity limitations of standard verifiable-reward RL. The method consistently yields superior performance and exploration dynamics across mathematics, sciences, and general reasoning tasks (Bi et al., 15 Nov 2025).

1. Motivation and Problem Setting

Conventional RLHF and RLVR pipelines for LLM reasoning have two core deficiencies in multi-domain settings:

Sparse, domain-limited rewards: Most prior RLHF and RLVR approaches use binary, sparse correctness rewards, which are only available in domains with verifiable solutions (such as math). Scientific and open-domain tasks often lack such automatic verifiers.
Exploration collapse: Purely on-policy optimization (e.g., PPO or GRPO) tends to rapidly reduce policy entropy, leading to premature convergence and poor exploration. This narrows the solution search space and prevents the discovery of higher-quality reasoning strategies.

RGR-GRPO addresses these shortcomings by constructing per-question rubrics that both (a) provide meaningful, dense, and interpretable rewards for intermediate steps, and (b) support targeted off-policy self-refinement when on-policy rollouts fail rubric checks (Bi et al., 15 Nov 2025).

2. Formal Definition and Mathematical Structure

2.1 Group Relative Policy Optimization (GRPO) Foundation

Let $q \sim \mathcal{D}$ denote a training prompt. For each prompt, a group of $G$ answers $\{o_i\}_{i=1}^G$ is sampled from $\pi_{\theta_{\text{old}}}$ . Each answer receives a scalar reward $r_i$ . The group-relative advantage is computed as: $A_i = \frac{r_i - \frac1G\sum_{j=1}^G r_j}{\sqrt{\frac1G\sum_{j=1}^G (r_j-\frac1G\sum_{k=1}^G r_k)^2}}$ The clipped surrogate policy objective is: $J_{\text{GRPO}}(\theta) = \mathbb{E}_{q, o_i}\left[ \sum_{i=1}^G \min\left(r_i(\theta)A_i, \operatorname{clip}(r_i(\theta),1-\epsilon,1+\epsilon)A_i\right) - \beta D_{KL}(\pi_{\theta}\|\pi_{\text{old}}) \right]$ where $r_i(\theta) = \frac{\pi_\theta(o_i \mid q)}{\pi_{\theta_{\text{old}}}(o_i \mid q)}$ .

2.2 Rubric-Driven Reward Specification

Each question $q$ is associated with a rubric $C = \{(d_k, w_k)\}_{k=1}^K$ , where each $G$ 0 is a testable criterion and $G$ 1 is a positive weight. For any generated answer $G$ 2, the $G$ 3th binary signal is: $G$ 4 The dense, normalized rubric reward: $G$ 5 If all factual criteria in $G$ 6 are satisfied, $G$ 7 is assigned; otherwise, the weighted average across all criteria is used.

2.3 RGR-GRPO Mixed On-/Off-policy Update

For each prompt $G$ 8, $G$ 9 on-policy samples $\{o_i\}_{i=1}^G$ 0 are scored, and $\{o_i\}_{i=1}^G$ 1. If $\{o_i\}_{i=1}^G$ 2 passes all rubric criteria, standard GRPO is performed with a final on-policy sample. Otherwise, failed criteria $\{o_i\}_{i=1}^G$ 3 guide self-refinement by generating $\{o_i\}_{i=1}^G$ 4. The mixed-policy update objective is: $\{o_i\}_{i=1}^G$ 5 with $\{o_i\}_{i=1}^G$ 6 and $\{o_i\}_{i=1}^G$ 7.

3. Rubric Construction and Integration

Rubrification proceeds in two LLM-assisted stages:

Reference answer creation: A strong “expert” LLM produces a canonical solution $\{o_i\}_{i=1}^G$ 8 for each prompt $\{o_i\}_{i=1}^G$ 9.
Rubric extraction: An LLM extracts 3–10 rubric items $\pi_{\theta_{\text{old}}}$ 0 from $\pi_{\theta_{\text{old}}}$ 1, partitioned into Factual (final/intermediate facts) and Process (solution steps).

During training, an “LLM-as-judge” applies each rubric criterion to candidate rollouts, producing a dense reward signal for each sample.

4. Training Algorithm and Implementation Details

Main training protocol:

Batch size: typically 96 prompts per update.
Sampling: For each prompt, $\pi_{\theta_{\text{old}}}$ 2 rollouts (7 on-policy, 1 off-policy for failed rubrics).
Rewards: Rubric scores as above.
On-policy and mixed on/off-policy updates as described in Section 2.
Optimizer: Adam with learning rate $\pi_{\theta_{\text{old}}}$ 3.
Evaluation: Greedy decoding, exact match (EM) and Pass@ $\pi_{\theta_{\text{old}}}$ 4 metrics.

Algorithmic pseudocode (simplified):

For each prompt $\pi_{\theta_{\text{old}}}$ 5, sample $\pi_{\theta_{\text{old}}}$ 6 trajectories, compute rubric rewards, select $\pi_{\theta_{\text{old}}}$ 7.
If $\pi_{\theta_{\text{old}}}$ 8 passes rubric, sample $\pi_{\theta_{\text{old}}}$ 9 on-policy and proceed with standard GRPO.
Else, generate $r_i$ 0 via rubric-guided refinement, compute reward, and update via mixed-policy objective as above.
Update model parameters via gradient ascent on $r_i$ 1.

5. Empirical Evaluation and Performance

RGR-GRPO has been comprehensively benchmarked across 14 multi-domain reasoning datasets spanning mathematics, physics, chemistry, and general tasks (Bi et al., 15 Nov 2025). Results, summarized in the following table, demonstrate robust outperformance over both verifiable-outcome RL baselines and alternative rubric-based or critique-based schemes.

Method	Math (↑)	Phys (↑)	Chem (↑)	Gen (↑)	Overall (↑)
Outcome-GRPO	61.2	47.0	47.0	52.3	52.3
Rubric-GRPO	65.3	47.9	47.9	53.7	53.7
Critique-GRPO	61.2	45.7	45.7	50.6	50.6
LUFFY	60.5	46.4	46.4	50.5	50.5
RGR-GRPO	67.9	48.8	48.8	55.8	55.8

RGR-GRPO delivers average improvements of +7.0% (math), +5.4% (physics), +8.4% (chemistry), and +6.6% (general tasks) over the Outcome-GRPO baseline.

6. Dynamics, Ablations, and Theoretical Properties

Exploration and Entropy

RGR-GRPO maintains stable entropy levels in [0.2, 0.4] throughout training, avoiding the entropy collapse characteristic of pure on-policy methods (which fall to ≈0.1). This signifies sustained and diversified exploration.

Ablation Findings

Removing Exploration Assessment (EA) destabilizes training and degrades test accuracy by −1.0 points.
Omitting the off-policy shaping term $r_i$ 2 reduces performance by −1.3 points.
Restricting the rubric to factual criteria alone causes a −0.6 point drop.

All rubric-driven components are critical for realizing the full multi-domain improvements.

Robustness and Generality

The framework maintains superior pass@ $r_i$ 3 extrapolation (i.e., high-quality diverse outputs as $r_i$ 4 grows). Gains persist across all 14 tested datasets, including out-of-distribution scientific and general reasoning domains.

7. Context and Comparative Significance

RGR-GRPO generalizes group-based policy optimization to settings lacking reliable verifiers, bridging dense, interpretable, and domain-adaptive reward shaping with robust, hybrid on-/off-policy exploration. It strictly improves over baselines such as Rubric-GRPO (rubric reward used with on-policy-only GRPO), critique augmentation, and alternative hybrid RL schemes. The method does not rely on reward model calibration or noise correction as in noise-corrected GRPO variants (mansouri et al., 21 Oct 2025), but introduces algorithmic machinery specifically tailored to leveraging rubrics as both reward and guidance. A plausible implication is that as LLM benchmarks grow in their demand for cross-domain and higher-order reasoning, rubric-guided frameworks such as RGR-GRPO are likely to become standard practice for exploration-aware training (Bi et al., 15 Nov 2025).

Markdown Report Issue Upgrade to Chat

References (2)

Reward and Guidance through Rubrics: Promoting Exploration to Improve Multi-Domain Reasoning (2025)

Noise-corrected GRPO: From Noisy Rewards to Unbiased Gradients (2025)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to RGR-GRPO.

RGR-GRPO: Rubric-Driven RL for Multi-Domain Reasoning

1. Motivation and Problem Setting

2. Formal Definition and Mathematical Structure

2.1 Group Relative Policy Optimization (GRPO) Foundation

2.2 Rubric-Driven Reward Specification

2.3 RGR-GRPO Mixed On-/Off-policy Update

3. Rubric Construction and Integration

4. Training Algorithm and Implementation Details

5. Empirical Evaluation and Performance

6. Dynamics, Ablations, and Theoretical Properties

Exploration and Entropy

Ablation Findings

Robustness and Generality

7. Context and Comparative Significance

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Don't miss out on important new AI/ML research

RGR-GRPO: Rubric-Driven RL for Multi-Domain Reasoning

1. Motivation and Problem Setting

2. Formal Definition and Mathematical Structure

2.1 Group Relative Policy Optimization (GRPO) Foundation

2.2 Rubric-Driven Reward Specification

2.3 RGR-GRPO Mixed On-/Off-policy Update

3. Rubric Construction and Integration

4. Training Algorithm and Implementation Details

5. Empirical Evaluation and Performance

6. Dynamics, Ablations, and Theoretical Properties

Exploration and Entropy

Ablation Findings

Robustness and Generality

7. Context and Comparative Significance

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Related Topics

Don't miss out on important new AI/ML research

Sign up for free to explore the frontiers of research