Papers
Topics
Authors
Recent
Search
2000 character limit reached

RGR-GRPO: Rubric-Driven RL for Multi-Domain Reasoning

Updated 14 June 2026
  • The paper introduces RGR-GRPO, which leverages detailed rubrics as dense rewards to overcome exploration collapse in multi-domain reinforcement learning tasks.
  • It extends group-based policy optimization by integrating off-policy self-refinement to address sparse rewards and stabilize policy entropy.
  • Empirical evaluations across 14 datasets in math, physics, chemistry, and general reasoning reveal significant performance gains over standard RL baselines.

RGR-GRPO (Reward and Guidance through Rubrics – Group Relative Policy Optimization) is a rubric-driven reinforcement learning (RL) framework that extends group-based policy optimization approaches for multi-domain reasoning, especially in LLMs. By leveraging fine-grained, question-specific rubrics as both dense reward sources and off-policy guidance, RGR-GRPO addresses the exploration and reward sparsity limitations of standard verifiable-reward RL. The method consistently yields superior performance and exploration dynamics across mathematics, sciences, and general reasoning tasks (Bi et al., 15 Nov 2025).

1. Motivation and Problem Setting

Conventional RLHF and RLVR pipelines for LLM reasoning have two core deficiencies in multi-domain settings:

  • Sparse, domain-limited rewards: Most prior RLHF and RLVR approaches use binary, sparse correctness rewards, which are only available in domains with verifiable solutions (such as math). Scientific and open-domain tasks often lack such automatic verifiers.
  • Exploration collapse: Purely on-policy optimization (e.g., PPO or GRPO) tends to rapidly reduce policy entropy, leading to premature convergence and poor exploration. This narrows the solution search space and prevents the discovery of higher-quality reasoning strategies.

RGR-GRPO addresses these shortcomings by constructing per-question rubrics that both (a) provide meaningful, dense, and interpretable rewards for intermediate steps, and (b) support targeted off-policy self-refinement when on-policy rollouts fail rubric checks (Bi et al., 15 Nov 2025).

2. Formal Definition and Mathematical Structure

2.1 Group Relative Policy Optimization (GRPO) Foundation

Let qDq \sim \mathcal{D} denote a training prompt. For each prompt, a group of GG answers {oi}i=1G\{o_i\}_{i=1}^G is sampled from πθold\pi_{\theta_{\text{old}}}. Each answer receives a scalar reward rir_i. The group-relative advantage is computed as: Ai=ri1Gj=1Grj1Gj=1G(rj1Gk=1Grk)2A_i = \frac{r_i - \frac1G\sum_{j=1}^G r_j}{\sqrt{\frac1G\sum_{j=1}^G (r_j-\frac1G\sum_{k=1}^G r_k)^2}} The clipped surrogate policy objective is: JGRPO(θ)=Eq,oi[i=1Gmin(ri(θ)Ai,clip(ri(θ),1ϵ,1+ϵ)Ai)βDKL(πθπold)]J_{\text{GRPO}}(\theta) = \mathbb{E}_{q, o_i}\left[ \sum_{i=1}^G \min\left(r_i(\theta)A_i, \operatorname{clip}(r_i(\theta),1-\epsilon,1+\epsilon)A_i\right) - \beta D_{KL}(\pi_{\theta}\|\pi_{\text{old}}) \right] where ri(θ)=πθ(oiq)πθold(oiq)r_i(\theta) = \frac{\pi_\theta(o_i \mid q)}{\pi_{\theta_{\text{old}}}(o_i \mid q)}.

2.2 Rubric-Driven Reward Specification

Each question qq is associated with a rubric C={(dk,wk)}k=1KC = \{(d_k, w_k)\}_{k=1}^K, where each GG0 is a testable criterion and GG1 is a positive weight. For any generated answer GG2, the GG3th binary signal is: GG4 The dense, normalized rubric reward: GG5 If all factual criteria in GG6 are satisfied, GG7 is assigned; otherwise, the weighted average across all criteria is used.

2.3 RGR-GRPO Mixed On-/Off-policy Update

For each prompt GG8, GG9 on-policy samples {oi}i=1G\{o_i\}_{i=1}^G0 are scored, and {oi}i=1G\{o_i\}_{i=1}^G1. If {oi}i=1G\{o_i\}_{i=1}^G2 passes all rubric criteria, standard GRPO is performed with a final on-policy sample. Otherwise, failed criteria {oi}i=1G\{o_i\}_{i=1}^G3 guide self-refinement by generating {oi}i=1G\{o_i\}_{i=1}^G4. The mixed-policy update objective is: {oi}i=1G\{o_i\}_{i=1}^G5 with {oi}i=1G\{o_i\}_{i=1}^G6 and {oi}i=1G\{o_i\}_{i=1}^G7.

3. Rubric Construction and Integration

Rubrification proceeds in two LLM-assisted stages:

  1. Reference answer creation: A strong “expert” LLM produces a canonical solution {oi}i=1G\{o_i\}_{i=1}^G8 for each prompt {oi}i=1G\{o_i\}_{i=1}^G9.
  2. Rubric extraction: An LLM extracts 3–10 rubric items πθold\pi_{\theta_{\text{old}}}0 from πθold\pi_{\theta_{\text{old}}}1, partitioned into Factual (final/intermediate facts) and Process (solution steps).

During training, an “LLM-as-judge” applies each rubric criterion to candidate rollouts, producing a dense reward signal for each sample.

4. Training Algorithm and Implementation Details

Main training protocol:

  • Batch size: typically 96 prompts per update.
  • Sampling: For each prompt, πθold\pi_{\theta_{\text{old}}}2 rollouts (7 on-policy, 1 off-policy for failed rubrics).
  • Rewards: Rubric scores as above.
  • On-policy and mixed on/off-policy updates as described in Section 2.
  • Optimizer: Adam with learning rate πθold\pi_{\theta_{\text{old}}}3.
  • Evaluation: Greedy decoding, exact match (EM) and Pass@πθold\pi_{\theta_{\text{old}}}4 metrics.

Algorithmic pseudocode (simplified):

  1. For each prompt πθold\pi_{\theta_{\text{old}}}5, sample πθold\pi_{\theta_{\text{old}}}6 trajectories, compute rubric rewards, select πθold\pi_{\theta_{\text{old}}}7.
  2. If πθold\pi_{\theta_{\text{old}}}8 passes rubric, sample πθold\pi_{\theta_{\text{old}}}9 on-policy and proceed with standard GRPO.
  3. Else, generate rir_i0 via rubric-guided refinement, compute reward, and update via mixed-policy objective as above.
  4. Update model parameters via gradient ascent on rir_i1.

5. Empirical Evaluation and Performance

RGR-GRPO has been comprehensively benchmarked across 14 multi-domain reasoning datasets spanning mathematics, physics, chemistry, and general tasks (Bi et al., 15 Nov 2025). Results, summarized in the following table, demonstrate robust outperformance over both verifiable-outcome RL baselines and alternative rubric-based or critique-based schemes.

Method Math (↑) Phys (↑) Chem (↑) Gen (↑) Overall (↑)
Outcome-GRPO 61.2 47.0 47.0 52.3 52.3
Rubric-GRPO 65.3 47.9 47.9 53.7 53.7
Critique-GRPO 61.2 45.7 45.7 50.6 50.6
LUFFY 60.5 46.4 46.4 50.5 50.5
RGR-GRPO 67.9 48.8 48.8 55.8 55.8

RGR-GRPO delivers average improvements of +7.0% (math), +5.4% (physics), +8.4% (chemistry), and +6.6% (general tasks) over the Outcome-GRPO baseline.

6. Dynamics, Ablations, and Theoretical Properties

Exploration and Entropy

RGR-GRPO maintains stable entropy levels in [0.2, 0.4] throughout training, avoiding the entropy collapse characteristic of pure on-policy methods (which fall to ≈0.1). This signifies sustained and diversified exploration.

Ablation Findings

  • Removing Exploration Assessment (EA) destabilizes training and degrades test accuracy by −1.0 points.
  • Omitting the off-policy shaping term rir_i2 reduces performance by −1.3 points.
  • Restricting the rubric to factual criteria alone causes a −0.6 point drop.

All rubric-driven components are critical for realizing the full multi-domain improvements.

Robustness and Generality

The framework maintains superior pass@rir_i3 extrapolation (i.e., high-quality diverse outputs as rir_i4 grows). Gains persist across all 14 tested datasets, including out-of-distribution scientific and general reasoning domains.

7. Context and Comparative Significance

RGR-GRPO generalizes group-based policy optimization to settings lacking reliable verifiers, bridging dense, interpretable, and domain-adaptive reward shaping with robust, hybrid on-/off-policy exploration. It strictly improves over baselines such as Rubric-GRPO (rubric reward used with on-policy-only GRPO), critique augmentation, and alternative hybrid RL schemes. The method does not rely on reward model calibration or noise correction as in noise-corrected GRPO variants (mansouri et al., 21 Oct 2025), but introduces algorithmic machinery specifically tailored to leveraging rubrics as both reward and guidance. A plausible implication is that as LLM benchmarks grow in their demand for cross-domain and higher-order reasoning, rubric-guided frameworks such as RGR-GRPO are likely to become standard practice for exploration-aware training (Bi et al., 15 Nov 2025).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (2)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to RGR-GRPO.