Papers
Topics
Authors
Recent
Search
2000 character limit reached

MD-GRPO: Group-Relative Policy Optimization

Updated 4 July 2026
  • MD-GRPO is a framework that normalizes rewards within candidate groups to tailor policy-gradient updates for various domain-specific tasks.
  • It enables dynamic candidate evaluation by comparing rewards in-context, as seen in applications like molecular design, medical report generation, and masked diffusion models.
  • The approach reduces gradient variance through techniques such as z-score normalization, centered rewards, and PPO-style clipping, adapting to heterogeneous reward scales.

MD-GRPO is a label applied in recent arXiv literature to several domain-specific uses of Group Relative Policy Optimization (GRPO): goal-directed molecular design in GRXForm, clinically aligned medical report generation in MRG-R1, and trajectory-level optimization of masked diffusion models in Co-GRPO. Across these settings, a policy generates a group of candidates for the same conditioning instance, rewards are compared within that group, and the resulting relative advantages are used for policy-gradient updates; the precise normalization rule, surrogate objective, and regularization differ by domain (Javaid et al., 12 Feb 2026, Wang et al., 18 Dec 2025, Zhou et al., 25 Dec 2025).

1. Core formulation

The common GRPO mechanism is a within-context normalization of rewards. In the molecular-design formulation, one samples BB starting scaffolds {Si}i=1B\{S_i\}_{i=1}^B, generates GG completions {Gi,1,,Gi,G}\{G_{i,1},\dots,G_{i,G}\} for each scaffold, computes rewards ri,j=R(Gi,j)r_{i,j}=R(G_{i,j}), and forms a group mean

μi=1Gj=1Gri,j,\mu_i = \frac{1}{G}\sum_{j=1}^G r_{i,j},

with an optional group standard deviation

σi=1Gj=1G(ri,jμi)2.\sigma_i = \sqrt{\frac{1}{G}\sum_{j=1}^G (r_{i,j}-\mu_i)^2}.

The general group-normalized reward is

r^i,j=ri,jμiσi,\hat r_{i,j}=\frac{r_{i,j}-\mu_i}{\sigma_i},

whereas the Dr. GRPO variant used in GRXForm drops the σi\sigma_i denominator and uses centered rewards

Ai,j=ri,jμi.A_{i,j}=r_{i,j}-\mu_i.

The gradient estimator then becomes

{Si}i=1B\{S_i\}_{i=1}^B0

In MRG-R1, the GRPO objective is a PPO-like clipped surrogate with a KL penalty toward a frozen reference policy: {Si}i=1B\{S_i\}_{i=1}^B1 where

{Si}i=1B\{S_i\}_{i=1}^B2

In Co-GRPO for masked diffusion models, the trajectory-level clipped surrogate is extended to a joint policy over denoising actions and schedule actions: {Si}i=1B\{S_i\}_{i=1}^B3

These formulations share a single structural idea: candidates are compared against other candidates generated for the same scaffold, study, or prompt rather than against a single batch-global baseline. This suggests a common emphasis on conditioning-specific credit assignment under heterogeneous reward scales (Javaid et al., 12 Feb 2026, Wang et al., 18 Dec 2025, Zhou et al., 25 Dec 2025).

2. Goal-directed molecular design

In "Amortized Molecular Optimization via Group Relative Policy Optimization" (Javaid et al., 12 Feb 2026), goal-directed molecular optimization under structural constraints is stated as learning, for any given starting molecular subgraph {Si}i=1B\{S_i\}_{i=1}^B4, a policy

{Si}i=1B\{S_i\}_{i=1}^B5

that sequentially elaborates {Si}i=1B\{S_i\}_{i=1}^B6 into a full molecule {Si}i=1B\{S_i\}_{i=1}^B7 in order to maximize an oracle reward

{Si}i=1B\{S_i\}_{i=1}^B8

The paper contrasts this with instance-optimizers such as genetic algorithms and discrete diffusion with fragment-remasking, which treat each {Si}i=1B\{S_i\}_{i=1}^B9 pair as a fresh combinatorial search problem and require thousands of expensive oracle calls per input.

GRXForm parametrizes GG0 as a decoder-only Graph Transformer in a step-wise MDP. The state GG1 is a partial molecular graph with node set GG2 and edge set GG3. Its hierarchical action space comprises: operation selection among “Stop,” “Add Atom,” or “Modify Existing Atom”; target selection of an existing atom GG4; and bond specification with bond order GG5. Valence masking GG6 masks invalid actions that would exceed an atom’s valence. Input embeddings combine atom type, current degree, and dynamic action-state flags; a virtual super-node connects to all atoms; multi-head self-attention uses ReZero normalization with bond-order attention biases GG7; and separate MLP heads emit logits for each action level. The model is pre-trained on ChEMBL by supervised teacher-forcing on ground-truth atom-by-atom trajectories.

Fine-tuning uses batch of scaffolds GG8, group size GG9 completions per scaffold, beam width {Gi,1,,Gi,G}\{G_{i,1},\dots,G_{i,G}\}0, learning rate {Gi,1,,Gi,G}\{G_{i,1},\dots,G_{i,G}\}1 with Adam, weight decay {Gi,1,,Gi,G}\{G_{i,1},\dots,G_{i,G}\}2, gradient clipping norm {Gi,1,,Gi,G}\{G_{i,1},\dots,G_{i,G}\}3, maximum fine-tuning epochs {Gi,1,,Gi,G}\{G_{i,1},\dots,G_{i,G}\}4, no entropy regularization in the scaffold-conditioned setting {Gi,1,,Gi,G}\{G_{i,1},\dots,G_{i,G}\}5, and a maximum of 50 atoms per molecule. The optimization loop samples scaffolds, generates completions via stochastic beam search, computes rewards, centers them by scaffold-specific means, accumulates the policy gradient, and updates {Gi,1,,Gi,G}\{G_{i,1},\dots,G_{i,G}\}6.

Empirically, the paper reports three settings. In kinase scaffold decoration with 500 held-out Murcko scaffolds cluster-split by Tanimoto {Gi,1,,Gi,G}\{G_{i,1},\dots,G_{i,G}\}7, using a 4-component MPO reward over GSK3{Gi,1,,Gi,G}\{G_{i,1},\dots,G_{i,G}\}8, JNK3, QED, and SA', GRXForm-GRPO reaches an objective score of {Gi,1,,Gi,G}\{G_{i,1},\dots,G_{i,G}\}9 and a strict success rate of ri,j=R(Gi,j)r_{i,j}=R(G_{i,j})0, compared with ri,j=R(Gi,j)r_{i,j}=R(G_{i,j})1 and ri,j=R(Gi,j)r_{i,j}=R(G_{i,j})2 for GRXForm-REINFORCE, ri,j=R(Gi,j)r_{i,j}=R(G_{i,j})3 and ri,j=R(Gi,j)r_{i,j}=R(G_{i,j})4 for GRXForm-DeNovo, and scores ri,j=R(Gi,j)r_{i,j}=R(G_{i,j})5 with ri,j=R(Gi,j)r_{i,j}=R(G_{i,j})6 success for Mol GA and GenMol. In prodrug transfer, fine-tuning on 4 parent drugs and testing on 5 unseen drugs, mean scores are 8.65 for GRXForm-REINFORCE and 10.69 for GRXForm-GRPO. On the PMO benchmark with 10 k oracle calls, GRXForm uses standard REINFORCE rather than grouping because all starts are empty, and reports aggregate sum AUC ri,j=R(Gi,j)r_{i,j}=R(G_{i,j})7, second overall.

The paper attributes the gain to heterogeneous task difficulty across starting scaffolds: some scaffolds admit easy high-reward elaborations, while others are chemically constrained. Centering rewards by scaffold-specific group means makes completions compete only among themselves, and Figure 1 is reported to show a mean advantage signal near zero with low variance under GRPO, in contrast to large swings under a global baseline.

3. Clinically aligned medical report generation

In "MRG-R1: Reinforcement Learning for Clinically Aligned Medical Report Generation" (Wang et al., 18 Dec 2025), MD-GRPO is the application of GRPO to medical report generation with a Med-LVLM policy ri,j=R(Gi,j)r_{i,j}=R(G_{i,j})8 that, given an input study ri,j=R(Gi,j)r_{i,j}=R(G_{i,j})9, autoregressively generates a report μi=1Gj=1Gri,j,\mu_i = \frac{1}{G}\sum_{j=1}^G r_{i,j},0. For each study, the method samples a group of μi=1Gj=1Gri,j,\mu_i = \frac{1}{G}\sum_{j=1}^G r_{i,j},1 candidate reports under the old policy, computes a report-level reward, normalizes rewards within the group, and applies a clipped policy-gradient update under a KL penalty toward a frozen reference policy.

The principal clinical reward is Margin-based Cosine Similarity (MCCS), derived from CheXbert’s 14-label chest-X-ray extraction. Multi-class labels are mapped into signed scalars

μi=1Gj=1Gri,j,\mu_i = \frac{1}{G}\sum_{j=1}^G r_{i,j},2

the “No Finding” dimension is discarded, and 13-dimensional signed label vectors μi=1Gj=1Gri,j,\mu_i = \frac{1}{G}\sum_{j=1}^G r_{i,j},3 are compared using

μi=1Gj=1Gri,j,\mu_i = \frac{1}{G}\sum_{j=1}^G r_{i,j},4

A margin μi=1Gj=1Gri,j,\mu_i = \frac{1}{G}\sum_{j=1}^G r_{i,j},5 produces

μi=1Gj=1Gri,j,\mu_i = \frac{1}{G}\sum_{j=1}^G r_{i,j},6

The total reward is

μi=1Gj=1Gri,j,\mu_i = \frac{1}{G}\sum_{j=1}^G r_{i,j},7

with μi=1Gj=1Gri,j,\mu_i = \frac{1}{G}\sum_{j=1}^G r_{i,j},8 and μi=1Gj=1Gri,j,\mu_i = \frac{1}{G}\sum_{j=1}^G r_{i,j},9. The format term lightly rewards compliance with the prescribed "> … → <report>…</report>" structure.

The base model is HuatuoGPT-Vision-7B-Qwen2.5VL, combining a ViT backbone with object-attention heads and a Qwen2.5 decoder. Fine-tuning uses LoRA adapters with rank σi=1Gj=1G(ri,jμi)2.\sigma_i = \sqrt{\frac{1}{G}\sum_{j=1}^G (r_{i,j}-\mu_i)^2}.0, σi=1Gj=1G(ri,jμi)2.\sigma_i = \sqrt{\frac{1}{G}\sum_{j=1}^G (r_{i,j}-\mu_i)^2}.1, and dropout σi=1Gj=1G(ri,jμi)2.\sigma_i = \sqrt{\frac{1}{G}\sum_{j=1}^G (r_{i,j}-\mu_i)^2}.2 on attention and MLP projections; the bulk of parameters remain frozen, and in SRL the LM head’s parameters are frozen to avoid catastrophic drift. Training uses 8-bit AdamW, learning rate σi=1Gj=1G(ri,jμi)2.\sigma_i = \sqrt{\frac{1}{G}\sum_{j=1}^G (r_{i,j}-\mu_i)^2}.3, σi=1Gj=1G(ri,jμi)2.\sigma_i = \sqrt{\frac{1}{G}\sum_{j=1}^G (r_{i,j}-\mu_i)^2}.4, σi=1Gj=1G(ri,jμi)2.\sigma_i = \sqrt{\frac{1}{G}\sum_{j=1}^G (r_{i,j}-\mu_i)^2}.5, weight decay σi=1Gj=1G(ri,jμi)2.\sigma_i = \sqrt{\frac{1}{G}\sum_{j=1}^G (r_{i,j}-\mu_i)^2}.6, cosine decay with 10% warm-up, gradient clipping max_norm σi=1Gj=1G(ri,jμi)2.\sigma_i = \sqrt{\frac{1}{G}\sum_{j=1}^G (r_{i,j}-\mu_i)^2}.7, effective batch size σi=1Gj=1G(ri,jμi)2.\sigma_i = \sqrt{\frac{1}{G}\sum_{j=1}^G (r_{i,j}-\mu_i)^2}.8, clipping threshold σi=1Gj=1G(ri,jμi)2.\sigma_i = \sqrt{\frac{1}{G}\sum_{j=1}^G (r_{i,j}-\mu_i)^2}.9, KL penalty r^i,j=ri,jμiσi,\hat r_{i,j}=\frac{r_{i,j}-\mu_i}{\sigma_i},0, 1 epoch of supervised warm-up, and 1 epoch of GRPO.

The primary metric is CheXbert-based clinical efficacy over the 14 standard chest X-ray observations. MRG-R1 reports CE-F1 r^i,j=ri,jμiσi,\hat r_{i,j}=\frac{r_{i,j}-\mu_i}{\sigma_i},1 on IU X-Ray and r^i,j=ri,jμiσi,\hat r_{i,j}=\frac{r_{i,j}-\mu_i}{\sigma_i},2 on MIMIC-CXR, compared with r^i,j=ri,jμiσi,\hat r_{i,j}=\frac{r_{i,j}-\mu_i}{\sigma_i},3 and r^i,j=ri,jμiσi,\hat r_{i,j}=\frac{r_{i,j}-\mu_i}{\sigma_i},4 for R2GenCMN and r^i,j=ri,jμiσi,\hat r_{i,j}=\frac{r_{i,j}-\mu_i}{\sigma_i},5 and r^i,j=ri,jμiσi,\hat r_{i,j}=\frac{r_{i,j}-\mu_i}{\sigma_i},6 for CheXagent. Ablations report CE-F1 r^i,j=ri,jμiσi,\hat r_{i,j}=\frac{r_{i,j}-\mu_i}{\sigma_i},7 for NLG rewards, r^i,j=ri,jμiσi,\hat r_{i,j}=\frac{r_{i,j}-\mu_i}{\sigma_i},8 for CE-F1 only, r^i,j=ri,jμiσi,\hat r_{i,j}=\frac{r_{i,j}-\mu_i}{\sigma_i},9 for MCCS only, and σi\sigma_i0 for MCCS plus format. The discussion attributes the improvement to direct optimization of clinical content agreement, polarity consistency, and semantic completeness rather than token overlap.

4. Masked diffusion and joint schedule optimization

In "Co-GRPO: Co-Optimized Group Relative Policy Optimization for Masked Diffusion Model" (Zhou et al., 25 Dec 2025), MD-GRPO refers to a GRPO-based formulation of masked diffusion model generation as a finite-horizon MDP σi\sigma_i1 with horizon σi\sigma_i2. The state is

σi\sigma_i3

where σi\sigma_i4 is the current discrete token canvas and σi\sigma_i5 is the text prompt. The action in Co-GRPO is

σi\sigma_i6

where σi\sigma_i7 collects sampling temperature σi\sigma_i8, guidance scale σi\sigma_i9, re-mask temperature Ai,j=ri,jμi.A_{i,j}=r_{i,j}-\mu_i.0, and re-mask ratio Ai,j=ri,jμi.A_{i,j}=r_{i,j}-\mu_i.1. Transitions are deterministic given the chosen next canvas and prompt, and the reward is zero at intermediate steps and issued only at the final step: Ai,j=ri,jμi.A_{i,j}=r_{i,j}-\mu_i.2

The joint policy is

Ai,j=ri,jμi.A_{i,j}=r_{i,j}-\mu_i.3

so that both the denoiser parameters Ai,j=ri,jμi.A_{i,j}=r_{i,j}-\mu_i.4 and schedule parameters Ai,j=ri,jμi.A_{i,j}=r_{i,j}-\mu_i.5 are optimized under a shared scalar reward. In practice, Ai,j=ri,jμi.A_{i,j}=r_{i,j}-\mu_i.6 is modeled as a small Gaussian whose mean Ai,j=ri,jμi.A_{i,j}=r_{i,j}-\mu_i.7 predicts each continuous schedule component. The shared reward can be written as a human-preference model score minus a small cost for overly aggressive schedules,

Ai,j=ri,jμi.A_{i,j}=r_{i,j}-\mu_i.8

A central technical property is that the method avoids back-propagation through the Ai,j=ri,jμi.A_{i,j}=r_{i,j}-\mu_i.9-step generation process. By using the likelihood-ratio trick, it only requires evaluation of {Si}i=1B\{S_i\}_{i=1}^B00 and collection of the terminal reward. The paper presents this as a memory and compute saving relative to differentiating through all intermediate activations of the denoising network.

The empirical evaluation fine-tunes a 1B-parameter Meissonic MDM with an approximately 9M-parameter scheduling network for 48 steps. Reported scores are: ImageReward, baseline {Si}i=1B\{S_i\}_{i=1}^B01 and Co-GRPO {Si}i=1B\{S_i\}_{i=1}^B02; HPSv2, baseline {Si}i=1B\{S_i\}_{i=1}^B03 and Co-GRPO {Si}i=1B\{S_i\}_{i=1}^B04; GenEval, baseline {Si}i=1B\{S_i\}_{i=1}^B05 and Co-GRPO {Si}i=1B\{S_i\}_{i=1}^B06; and DPG-Bench, baseline {Si}i=1B\{S_i\}_{i=1}^B07 and Co-GRPO {Si}i=1B\{S_i\}_{i=1}^B08. The paper emphasizes that the schedule network adds less than 1% extra parameters.

The supplied papers motivate MD-GRPO largely through variance control, but they do so in different regimes. In molecular design, the stated failure mode is high variance from the heterogeneous difficulty of distinct starting structures: a global baseline causes gradients to be dominated by “easy” tasks, whereas per-scaffold group means ensure that even if all rewards in a group are low, the best completion still receives positive advantage. In MRG-R1, group-relative advantages are presented as lower-variance than vanilla REINFORCE or actor-critic and as eliminating the need for a learned value network. In Co-GRPO, the group-relative objective is combined with PPO-style clipping and KL regularization while jointly optimizing both model and schedule parameters (Javaid et al., 12 Feb 2026, Wang et al., 18 Dec 2025, Zhou et al., 25 Dec 2025).

A closely related but distinct development is "MDP-GRPO: Stabilized Group Relative Policy Optimization for Multi-Constraint Instruction Following" (Salmani-Zarchi et al., 4 Jun 2026). That paper identifies three pathologies of z-score group normalization under discrete, low-dispersion rewards: low-variance amplification, mean-centering blindness, and zero-variance collapse. Its remedies are multi-temperature sampling, dual-anchor advantages, prospect-theoretic shaping, and asymmetric KL regularization. The mixed advantage is

{Si}i=1B\{S_i\}_{i=1}^B09

where {Si}i=1B\{S_i\}_{i=1}^B10 is the group-relative anchor, {Si}i=1B\{S_i\}_{i=1}^B11 is a goal-aware anchor, and both can be shaped by a bounded loss-averse transform. Reported gains include up to {Si}i=1B\{S_i\}_{i=1}^B12 percentage points in Hard Success Rate on FollowBench, IFEval, and a curated multi-constraint dataset, with general-capability benchmarks remaining unchanged within {Si}i=1B\{S_i\}_{i=1}^B13 percentage points.

The juxtaposition is significant because it shows that group-relative normalization is not uniformly stable across reward regimes. In heterogeneous continuous-reward settings it is used to reduce variance; in discrete low-dispersion settings, additional mechanisms may be required to avoid vanishing or distorted learning signals.

6. Terminological scope and comparative profile

The term “MD-GRPO” is not used in a single uniform sense across the supplied literature. In GRXForm it denotes GRPO for amortized molecular optimization; in MRG-R1 it denotes GRPO for medical report generation; and in Co-GRPO it is used for masked diffusion models. By contrast, MDP-GRPO is a separate stabilized variant for multi-constraint instruction following rather than another use of the same label (Javaid et al., 12 Feb 2026, Wang et al., 18 Dec 2025, Zhou et al., 25 Dec 2025, Salmani-Zarchi et al., 4 Jun 2026).

The algorithmic differences are substantial. GRXForm uses a decoder-only Graph Transformer with chemically valid action masking and, in the reported scaffold-conditioned setting, centered rewards {Si}i=1B\{S_i\}_{i=1}^B14 without entropy regularization. MRG-R1 uses a Med-LVLM fine-tuned with LoRA, a clipped surrogate, KL regularization toward a frozen reference policy, and an MCCS-based clinical reward augmented by a lightweight format reward. Co-GRPO treats masked diffusion inference itself as an MDP and jointly optimizes denoiser parameters and schedule parameters. A plausible implication is that “MD-GRPO” functions more as a family resemblance around group-relative policy optimization than as a single canonical algorithmic specification.

Setting Mechanism Reported outcome
GRXForm molecular optimization Per-scaffold centered rewards in scaffold-conditioned fine-tuning Obj. Score {Si}i=1B\{S_i\}_{i=1}^B15, Success Rate {Si}i=1B\{S_i\}_{i=1}^B16
MRG-R1 medical report generation Clipped GRPO with MCCS reward and format reward CE-F1 {Si}i=1B\{S_i\}_{i=1}^B17 on IU X-Ray, {Si}i=1B\{S_i\}_{i=1}^B18 on MIMIC-CXR
Co-GRPO masked diffusion Joint {Si}i=1B\{S_i\}_{i=1}^B19 optimization of model and schedule ImageReward {Si}i=1B\{S_i\}_{i=1}^B20, HPSv2 {Si}i=1B\{S_i\}_{i=1}^B21, GenEval {Si}i=1B\{S_i\}_{i=1}^B22, DPG-Bench {Si}i=1B\{S_i\}_{i=1}^B23

One recurrent misconception is that GRPO implies a fixed normalization rule. The supplied literature shows otherwise: some formulations use z-score normalization, some use centered rewards only, and some add PPO-style clipping and KL penalties. Another is that grouping is always required. In the PMO de-novo benchmark, GRXForm uses standard REINFORCE rather than grouping because all starts are empty. The unifying point is therefore not a single implementation detail, but the use of within-context relative rewards to shape policy updates under trajectory-level objectives.

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to MD-GRPO.