MD-GRPO: Group-Relative Policy Optimization
- MD-GRPO is a framework that normalizes rewards within candidate groups to tailor policy-gradient updates for various domain-specific tasks.
- It enables dynamic candidate evaluation by comparing rewards in-context, as seen in applications like molecular design, medical report generation, and masked diffusion models.
- The approach reduces gradient variance through techniques such as z-score normalization, centered rewards, and PPO-style clipping, adapting to heterogeneous reward scales.
MD-GRPO is a label applied in recent arXiv literature to several domain-specific uses of Group Relative Policy Optimization (GRPO): goal-directed molecular design in GRXForm, clinically aligned medical report generation in MRG-R1, and trajectory-level optimization of masked diffusion models in Co-GRPO. Across these settings, a policy generates a group of candidates for the same conditioning instance, rewards are compared within that group, and the resulting relative advantages are used for policy-gradient updates; the precise normalization rule, surrogate objective, and regularization differ by domain (Javaid et al., 12 Feb 2026, Wang et al., 18 Dec 2025, Zhou et al., 25 Dec 2025).
1. Core formulation
The common GRPO mechanism is a within-context normalization of rewards. In the molecular-design formulation, one samples starting scaffolds , generates completions for each scaffold, computes rewards , and forms a group mean
with an optional group standard deviation
The general group-normalized reward is
whereas the Dr. GRPO variant used in GRXForm drops the denominator and uses centered rewards
The gradient estimator then becomes
0
In MRG-R1, the GRPO objective is a PPO-like clipped surrogate with a KL penalty toward a frozen reference policy: 1 where
2
In Co-GRPO for masked diffusion models, the trajectory-level clipped surrogate is extended to a joint policy over denoising actions and schedule actions: 3
These formulations share a single structural idea: candidates are compared against other candidates generated for the same scaffold, study, or prompt rather than against a single batch-global baseline. This suggests a common emphasis on conditioning-specific credit assignment under heterogeneous reward scales (Javaid et al., 12 Feb 2026, Wang et al., 18 Dec 2025, Zhou et al., 25 Dec 2025).
2. Goal-directed molecular design
In "Amortized Molecular Optimization via Group Relative Policy Optimization" (Javaid et al., 12 Feb 2026), goal-directed molecular optimization under structural constraints is stated as learning, for any given starting molecular subgraph 4, a policy
5
that sequentially elaborates 6 into a full molecule 7 in order to maximize an oracle reward
8
The paper contrasts this with instance-optimizers such as genetic algorithms and discrete diffusion with fragment-remasking, which treat each 9 pair as a fresh combinatorial search problem and require thousands of expensive oracle calls per input.
GRXForm parametrizes 0 as a decoder-only Graph Transformer in a step-wise MDP. The state 1 is a partial molecular graph with node set 2 and edge set 3. Its hierarchical action space comprises: operation selection among “Stop,” “Add Atom,” or “Modify Existing Atom”; target selection of an existing atom 4; and bond specification with bond order 5. Valence masking 6 masks invalid actions that would exceed an atom’s valence. Input embeddings combine atom type, current degree, and dynamic action-state flags; a virtual super-node connects to all atoms; multi-head self-attention uses ReZero normalization with bond-order attention biases 7; and separate MLP heads emit logits for each action level. The model is pre-trained on ChEMBL by supervised teacher-forcing on ground-truth atom-by-atom trajectories.
Fine-tuning uses batch of scaffolds 8, group size 9 completions per scaffold, beam width 0, learning rate 1 with Adam, weight decay 2, gradient clipping norm 3, maximum fine-tuning epochs 4, no entropy regularization in the scaffold-conditioned setting 5, and a maximum of 50 atoms per molecule. The optimization loop samples scaffolds, generates completions via stochastic beam search, computes rewards, centers them by scaffold-specific means, accumulates the policy gradient, and updates 6.
Empirically, the paper reports three settings. In kinase scaffold decoration with 500 held-out Murcko scaffolds cluster-split by Tanimoto 7, using a 4-component MPO reward over GSK38, JNK3, QED, and SA', GRXForm-GRPO reaches an objective score of 9 and a strict success rate of 0, compared with 1 and 2 for GRXForm-REINFORCE, 3 and 4 for GRXForm-DeNovo, and scores 5 with 6 success for Mol GA and GenMol. In prodrug transfer, fine-tuning on 4 parent drugs and testing on 5 unseen drugs, mean scores are 8.65 for GRXForm-REINFORCE and 10.69 for GRXForm-GRPO. On the PMO benchmark with 10 k oracle calls, GRXForm uses standard REINFORCE rather than grouping because all starts are empty, and reports aggregate sum AUC 7, second overall.
The paper attributes the gain to heterogeneous task difficulty across starting scaffolds: some scaffolds admit easy high-reward elaborations, while others are chemically constrained. Centering rewards by scaffold-specific group means makes completions compete only among themselves, and Figure 1 is reported to show a mean advantage signal near zero with low variance under GRPO, in contrast to large swings under a global baseline.
3. Clinically aligned medical report generation
In "MRG-R1: Reinforcement Learning for Clinically Aligned Medical Report Generation" (Wang et al., 18 Dec 2025), MD-GRPO is the application of GRPO to medical report generation with a Med-LVLM policy 8 that, given an input study 9, autoregressively generates a report 0. For each study, the method samples a group of 1 candidate reports under the old policy, computes a report-level reward, normalizes rewards within the group, and applies a clipped policy-gradient update under a KL penalty toward a frozen reference policy.
The principal clinical reward is Margin-based Cosine Similarity (MCCS), derived from CheXbert’s 14-label chest-X-ray extraction. Multi-class labels are mapped into signed scalars
2
the “No Finding” dimension is discarded, and 13-dimensional signed label vectors 3 are compared using
4
A margin 5 produces
6
The total reward is
7
with 8 and 9. The format term lightly rewards compliance with the prescribed "> … → <report>…</report>" structure.
The base model is HuatuoGPT-Vision-7B-Qwen2.5VL, combining a ViT backbone with object-attention heads and a Qwen2.5 decoder. Fine-tuning uses LoRA adapters with rank 0, 1, and dropout 2 on attention and MLP projections; the bulk of parameters remain frozen, and in SRL the LM head’s parameters are frozen to avoid catastrophic drift. Training uses 8-bit AdamW, learning rate 3, 4, 5, weight decay 6, cosine decay with 10% warm-up, gradient clipping max_norm 7, effective batch size 8, clipping threshold 9, KL penalty 0, 1 epoch of supervised warm-up, and 1 epoch of GRPO.
The primary metric is CheXbert-based clinical efficacy over the 14 standard chest X-ray observations. MRG-R1 reports CE-F1 1 on IU X-Ray and 2 on MIMIC-CXR, compared with 3 and 4 for R2GenCMN and 5 and 6 for CheXagent. Ablations report CE-F1 7 for NLG rewards, 8 for CE-F1 only, 9 for MCCS only, and 0 for MCCS plus format. The discussion attributes the improvement to direct optimization of clinical content agreement, polarity consistency, and semantic completeness rather than token overlap.
4. Masked diffusion and joint schedule optimization
In "Co-GRPO: Co-Optimized Group Relative Policy Optimization for Masked Diffusion Model" (Zhou et al., 25 Dec 2025), MD-GRPO refers to a GRPO-based formulation of masked diffusion model generation as a finite-horizon MDP 1 with horizon 2. The state is
3
where 4 is the current discrete token canvas and 5 is the text prompt. The action in Co-GRPO is
6
where 7 collects sampling temperature 8, guidance scale 9, re-mask temperature 0, and re-mask ratio 1. Transitions are deterministic given the chosen next canvas and prompt, and the reward is zero at intermediate steps and issued only at the final step: 2
The joint policy is
3
so that both the denoiser parameters 4 and schedule parameters 5 are optimized under a shared scalar reward. In practice, 6 is modeled as a small Gaussian whose mean 7 predicts each continuous schedule component. The shared reward can be written as a human-preference model score minus a small cost for overly aggressive schedules,
8
A central technical property is that the method avoids back-propagation through the 9-step generation process. By using the likelihood-ratio trick, it only requires evaluation of 00 and collection of the terminal reward. The paper presents this as a memory and compute saving relative to differentiating through all intermediate activations of the denoising network.
The empirical evaluation fine-tunes a 1B-parameter Meissonic MDM with an approximately 9M-parameter scheduling network for 48 steps. Reported scores are: ImageReward, baseline 01 and Co-GRPO 02; HPSv2, baseline 03 and Co-GRPO 04; GenEval, baseline 05 and Co-GRPO 06; and DPG-Bench, baseline 07 and Co-GRPO 08. The paper emphasizes that the schedule network adds less than 1% extra parameters.
5. Variance reduction, instability, and related stabilized variants
The supplied papers motivate MD-GRPO largely through variance control, but they do so in different regimes. In molecular design, the stated failure mode is high variance from the heterogeneous difficulty of distinct starting structures: a global baseline causes gradients to be dominated by “easy” tasks, whereas per-scaffold group means ensure that even if all rewards in a group are low, the best completion still receives positive advantage. In MRG-R1, group-relative advantages are presented as lower-variance than vanilla REINFORCE or actor-critic and as eliminating the need for a learned value network. In Co-GRPO, the group-relative objective is combined with PPO-style clipping and KL regularization while jointly optimizing both model and schedule parameters (Javaid et al., 12 Feb 2026, Wang et al., 18 Dec 2025, Zhou et al., 25 Dec 2025).
A closely related but distinct development is "MDP-GRPO: Stabilized Group Relative Policy Optimization for Multi-Constraint Instruction Following" (Salmani-Zarchi et al., 4 Jun 2026). That paper identifies three pathologies of z-score group normalization under discrete, low-dispersion rewards: low-variance amplification, mean-centering blindness, and zero-variance collapse. Its remedies are multi-temperature sampling, dual-anchor advantages, prospect-theoretic shaping, and asymmetric KL regularization. The mixed advantage is
09
where 10 is the group-relative anchor, 11 is a goal-aware anchor, and both can be shaped by a bounded loss-averse transform. Reported gains include up to 12 percentage points in Hard Success Rate on FollowBench, IFEval, and a curated multi-constraint dataset, with general-capability benchmarks remaining unchanged within 13 percentage points.
The juxtaposition is significant because it shows that group-relative normalization is not uniformly stable across reward regimes. In heterogeneous continuous-reward settings it is used to reduce variance; in discrete low-dispersion settings, additional mechanisms may be required to avoid vanishing or distorted learning signals.
6. Terminological scope and comparative profile
The term “MD-GRPO” is not used in a single uniform sense across the supplied literature. In GRXForm it denotes GRPO for amortized molecular optimization; in MRG-R1 it denotes GRPO for medical report generation; and in Co-GRPO it is used for masked diffusion models. By contrast, MDP-GRPO is a separate stabilized variant for multi-constraint instruction following rather than another use of the same label (Javaid et al., 12 Feb 2026, Wang et al., 18 Dec 2025, Zhou et al., 25 Dec 2025, Salmani-Zarchi et al., 4 Jun 2026).
The algorithmic differences are substantial. GRXForm uses a decoder-only Graph Transformer with chemically valid action masking and, in the reported scaffold-conditioned setting, centered rewards 14 without entropy regularization. MRG-R1 uses a Med-LVLM fine-tuned with LoRA, a clipped surrogate, KL regularization toward a frozen reference policy, and an MCCS-based clinical reward augmented by a lightweight format reward. Co-GRPO treats masked diffusion inference itself as an MDP and jointly optimizes denoiser parameters and schedule parameters. A plausible implication is that “MD-GRPO” functions more as a family resemblance around group-relative policy optimization than as a single canonical algorithmic specification.
| Setting | Mechanism | Reported outcome |
|---|---|---|
| GRXForm molecular optimization | Per-scaffold centered rewards in scaffold-conditioned fine-tuning | Obj. Score 15, Success Rate 16 |
| MRG-R1 medical report generation | Clipped GRPO with MCCS reward and format reward | CE-F1 17 on IU X-Ray, 18 on MIMIC-CXR |
| Co-GRPO masked diffusion | Joint 19 optimization of model and schedule | ImageReward 20, HPSv2 21, GenEval 22, DPG-Bench 23 |
One recurrent misconception is that GRPO implies a fixed normalization rule. The supplied literature shows otherwise: some formulations use z-score normalization, some use centered rewards only, and some add PPO-style clipping and KL penalties. Another is that grouping is always required. In the PMO de-novo benchmark, GRXForm uses standard REINFORCE rather than grouping because all starts are empty. The unifying point is therefore not a single implementation detail, but the use of within-context relative rewards to shape policy updates under trajectory-level objectives.