Papers
Topics
Authors
Recent
Search
2000 character limit reached

Dual-GRPO: Dual Group Relative Policy Optimization

Updated 4 July 2026
  • Dual-GRPO is a design pattern that extends traditional group-relative policy optimization by integrating a second optimization structure for enhanced signal processing.
  • It mitigates issues like weak, noisy, or coarse reward signals through mechanisms such as dual reward guidance, dual anchors, and dual controllers.
  • Dual-GRPO variants apply across applications like photorealistic portrait generation, multimodal reasoning, and multi-stage chain-of-thought optimization, yielding notable empirical gains.

“Dual-GRPO” is best understood as an umbrella label for Group Relative Policy Optimization variants in which the standard group-relative surrogate is augmented by a second, coupled optimization structure. In recent work, that second structure has taken several distinct forms: exemplar-driven sampling plus dual reward guidance in photorealistic portrait generation (Li et al., 25 Jun 2026), dual-anchor advantages for low-dispersion verifiable rewards (Salmani-Zarchi et al., 4 Jun 2026), a shared probe state that controls both clipping and temperature (Hu et al., 20 May 2026), a two-tier reward for answer correctness and reasoning consistency (Chen et al., 19 Jun 2025), and two-level or two-stage decompositions of reasoning and correction (Wang et al., 29 Sep 2025, Ding et al., 5 Jun 2025). This suggests that “Dual-GRPO” is not a single canonical algorithm, but a recurrent design pattern for extending GRPO when a single group-relative signal is too weak, too noisy, or too coarse.

1. Canonical GRPO substrate

All Dual-GRPO interpretations retain the basic GRPO scaffold: for each conditioning input, a policy samples a group of rollouts, assigns a scalar reward to each rollout, normalizes rewards within the group, and applies a PPO-style clipped update with those group-relative advantages. In the text-to-image formulation used by PortraitGen, the advantage is written as

A^ti=rimean({ri}i=1G)std({ri}i=1G),\hat{A}^i_t = \frac{r_i - \text{mean}(\{r_i\}_{i=1}^G)}{\text{std}(\{r_i\}_{i=1}^G)},

and the policy update uses the standard clipped importance-ratio form with KL regularization to a reference policy (Li et al., 25 Jun 2026).

The same critic-free structure appears in language-model and instruction-following variants. MDP-GRPO restates the standard GRPO normalization as

zi=riμgroupσgroup+ϵ,z_i = \frac{r_i - \mu_{\text{group}}}{\sigma_{\text{group}} + \epsilon},

with μgroup\mu_{\text{group}} and σgroup\sigma_{\text{group}} computed over the completions in a prompt-local group (Salmani-Zarchi et al., 4 Jun 2026). AGPO likewise starts from the GRPO surrogate

$J_{\text{GRPO}}(\theta) = \mathbb{E}\Biggl[ \frac{1}{G}\sum_{i=1}^{G} \Bigl( \min\bigl( \rho_i(\theta)\,A_i,\operatorname{clip}(\rho_i(\theta),1-\varepsilon,1+\varepsilon)A_i \bigr) -\beta\,D_{\mathrm{KL}\!\bigl(\pi_\theta\parallel\pi_{\text{ref}}\bigr) \Bigr) \Biggr],$

and changes only how AiA_i, ε\varepsilon, and the rollout temperature are controlled (Hu et al., 20 May 2026).

What distinguishes Dual-GRPO variants is therefore not the removal of the GRPO core, but the insertion of a second structure that supplements pure group-relative z-scoring. That second structure may act on the reward, the advantage, the sampling group, the rollout temperature, or the stage decomposition of the policy itself.

2. Dual reward guidance

The most direct Dual-GRPO instantiation is “PortraitGen: Exemplar-Driven GRPO with Dual-Reward Guidance for Photorealistic Portrait Generation” (Li et al., 25 Jun 2026). Its duality is explicit at the reward level. Each GRPO group contains G1G-1 model-generated images plus one real exemplar image inverted with BELM, and rewards are computed by two complementary models: OmniReward for general image quality and AI-Portrait for human-centric fidelity. OmniReward scores each image on Content, Clarity, Lighting and Color, and Composition, while AI-Portrait performs exhaustive within-group pairwise comparisons and assigns a win-rate

Rwin(oi)=1G1j=1,jiGI(oioj),R_{win}(o_i) = \frac{1}{G-1} \sum_{j=1, j \neq i}^G \mathbb{I}(o_i \succ o_j),

where I(oioj)\mathbb{I}(o_i \succ o_j) is zi=riμgroupσgroup+ϵ,z_i = \frac{r_i - \mu_{\text{group}}}{\sigma_{\text{group}} + \epsilon},0 if zi=riμgroupσgroup+ϵ,z_i = \frac{r_i - \mu_{\text{group}}}{\sigma_{\text{group}} + \epsilon},1 contains fewer synthetic artifacts than zi=riμgroupσgroup+ϵ,z_i = \frac{r_i - \mu_{\text{group}}}{\sigma_{\text{group}} + \epsilon},2, and zi=riμgroupσgroup+ϵ,z_i = \frac{r_i - \mu_{\text{group}}}{\sigma_{\text{group}} + \epsilon},3 otherwise (Li et al., 25 Jun 2026). The paper refers to this combination as a “Dual-Reward” module.

A closely related reward-level duality appears in GRPO-CARE for multimodal reasoning (Chen et al., 19 Jun 2025). There, standard outcome-supervised GRPO is extended with a two-tier reward: a base reward for answer correctness and formatting, plus an adaptive consistency bonus computed from a slowly evolving EMA reference model. For a trajectory zi=riμgroupσgroup+ϵ,z_i = \frac{r_i - \mu_{\text{group}}}{\sigma_{\text{group}} + \epsilon},4, the reference model estimates the likelihood of the answer conditioned on the reasoning trace,

zi=riμgroupσgroup+ϵ,z_i = \frac{r_i - \mu_{\text{group}}}{\sigma_{\text{group}} + \epsilon},5

and a consistency bonus is awarded only to relatively high-accuracy trajectories whose clipped zi=riμgroupσgroup+ϵ,z_i = \frac{r_i - \mu_{\text{group}}}{\sigma_{\text{group}} + \epsilon},6 exceeds the group baseline (Chen et al., 19 Jun 2025). The total reward is

zi=riμgroupσgroup+ϵ,z_i = \frac{r_i - \mu_{\text{group}}}{\sigma_{\text{group}} + \epsilon},7

and the explicit KL penalty of standard GRPO is removed in favor of this reward-side regularization (Chen et al., 19 Jun 2025).

In both cases, the second reward channel is not a learned critic. It is a second comparator acting on a different failure mode: OmniReward complements AI-Portrait’s anti-artifact discrimination, and answer correctness is complemented by reasoning-to-answer consistency. Dual-GRPO in this sense denotes GRPO with two simultaneously active reward semantics rather than a single scalar preference.

3. Dual anchors and dual statistical control

A second lineage of Dual-GRPO modifies the advantage estimator itself. MDP-GRPO addresses three pathologies of z-score normalization under discrete multi-constraint rewards—low-variance amplification, mean-centering blindness, and zero-variance collapse—by adding a second anchor to the standard group-relative score (Salmani-Zarchi et al., 4 Jun 2026). The first anchor is the ordinary GRPO term

zi=riμgroupσgroup+ϵ,z_i = \frac{r_i - \mu_{\text{group}}}{\sigma_{\text{group}} + \epsilon},8

while the second is a goal-aware absolute anchor

zi=riμgroupσgroup+ϵ,z_i = \frac{r_i - \mu_{\text{group}}}{\sigma_{\text{group}} + \epsilon},9

After prospect-theoretic shaping

μgroup\mu_{\text{group}}0

the final advantage is

μgroup\mu_{\text{group}}1

This is a literal dual-anchor GRPO: one term preserves relative ranking within the group, the other restores an absolute notion of how far the rollout is from satisfying the constraint set (Salmani-Zarchi et al., 4 Jun 2026).

AGPO introduces a different kind of duality: not two advantage anchors, but two controllers driven by one shared probe-derived state (Hu et al., 20 May 2026). Its uncertainty score is

μgroup\mu_{\text{group}}2

where μgroup\mu_{\text{group}}3 is reward dispersion, μgroup\mu_{\text{group}}4 is probe vote entropy, and μgroup\mu_{\text{group}}5 is safeguarded reward skewness (Hu et al., 20 May 2026). The same state then drives two separate control laws. The rollout temperature is

μgroup\mu_{\text{group}}6

while the adaptive clip radius is

μgroup\mu_{\text{group}}7

AGPO therefore operationalizes Dual-GRPO as “one probe, two controllers”: one controls exploration, the other the trust region (Hu et al., 20 May 2026).

4. Duality in group construction

Another interpretation of Dual-GRPO changes the composition of the GRPO group itself. PortraitGen is the clearest example. Instead of sampling all μgroup\mu_{\text{group}}8 rollouts from the old policy, it generates μgroup\mu_{\text{group}}9 portraits through the reverse SDE and inserts one real photograph from the training set as the final member of the group. BELM inversion is used to recover the exemplar’s intermediate latents and step-wise trajectory probabilities, so that the real image can be treated as another rollout in the GRPO objective (Li et al., 25 Jun 2026). This produces a dual group: model-generated samples versus an inverted real exemplar.

At the opposite extreme, “It Takes Two: Your GRPO Is Secretly DPO” reduces the group to its minimal nontrivial size and argues that GRPO is fundamentally contrastive (Wu et al., 1 Oct 2025). In the σgroup\sigma_{\text{group}}0 case, rewards induce a positive rollout σgroup\sigma_{\text{group}}1 and a negative rollout σgroup\sigma_{\text{group}}2, and the resulting 2-GRPO objective is

σgroup\sigma_{\text{group}}3

Under binary rewards, the within-pair advantage effectively collapses to σgroup\sigma_{\text{group}}4: the better rollout is pushed up, the worse rollout is pushed down, and ties yield no update (Wu et al., 1 Oct 2025). Dual-GRPO in this interpretation means pairwise positive-versus-negative GRPO.

Multi-GRPO provides a broader grouping formalism and makes the “dual reward group” case explicit (Lyu et al., 30 Nov 2025). In its multi-objective setting, rewards are normalized independently before aggregation; with exactly two reward groups, the combined advantage is

σgroup\sigma_{\text{group}}5

The paper states that a hypothetical Dual-GRPO is therefore a constrained or simplified Multi-GRPO instance, typically with two reward groups, two branches, or both (Lyu et al., 30 Nov 2025).

5. Two-level and two-stage optimization

A further family of Dual-GRPO variants factorizes the rollout into two semantic stages rather than two reward channels. GRPO-MA decomposes chain-of-thought training into a thought policy and an answer policy, both sharing the same underlying parameters (Wang et al., 29 Sep 2025). For each prompt, the model samples σgroup\sigma_{\text{group}}6 thoughts σgroup\sigma_{\text{group}}7, then σgroup\sigma_{\text{group}}8 answers σgroup\sigma_{\text{group}}9 per thought. The thought value is

$J_{\text{GRPO}}(\theta) = \mathbb{E}\Biggl[ \frac{1}{G}\sum_{i=1}^{G} \Bigl( \min\bigl( \rho_i(\theta)\,A_i,\operatorname{clip}(\rho_i(\theta),1-\varepsilon,1+\varepsilon)A_i \bigr) -\beta\,D_{\mathrm{KL}\!\bigl(\pi_\theta\parallel\pi_{\text{ref}}\bigr) \Bigr) \Biggr],$0

the thought advantage is normalized across thoughts,

$J_{\text{GRPO}}(\theta) = \mathbb{E}\Biggl[ \frac{1}{G}\sum_{i=1}^{G} \Bigl( \min\bigl( \rho_i(\theta)\,A_i,\operatorname{clip}(\rho_i(\theta),1-\varepsilon,1+\varepsilon)A_i \bigr) -\beta\,D_{\mathrm{KL}\!\bigl(\pi_\theta\parallel\pi_{\text{ref}}\bigr) \Bigr) \Biggr],$1

and the answer advantage is normalized across all answers,

$J_{\text{GRPO}}(\theta) = \mathbb{E}\Biggl[ \frac{1}{G}\sum_{i=1}^{G} \Bigl( \min\bigl( \rho_i(\theta)\,A_i,\operatorname{clip}(\rho_i(\theta),1-\varepsilon,1+\varepsilon)A_i \bigr) -\beta\,D_{\mathrm{KL}\!\bigl(\pi_\theta\parallel\pi_{\text{ref}}\bigr) \Bigr) \Biggr],$2

The final loss is the sum of a GRPO-style term on thought tokens and a GRPO-style term on answer tokens (Wang et al., 29 Sep 2025). Here the duality is architectural: reasoning and answer generation receive distinct relative feedback.

Multi-Layer GRPO pushes the stage decomposition further by explicitly introducing two GRPO layers (Ding et al., 5 Jun 2025). The first layer uses standard GRPO to generate an initial response for the original query. The second layer receives the original query together with the first-layer response and is trained, again with GRPO, to identify and correct errors in that initial response. The same policy is shared across both layers, but the training distribution is dual-stage: $J_{\text{GRPO}}(\theta) = \mathbb{E}\Biggl[ \frac{1}{G}\sum_{i=1}^{G} \Bigl( \min\bigl( \rho_i(\theta)\,A_i,\operatorname{clip}(\rho_i(\theta),1-\varepsilon,1+\varepsilon)A_i \bigr) -\beta\,D_{\mathrm{KL}\!\bigl(\pi_\theta\parallel\pi_{\text{ref}}\bigr) \Bigr) \Biggr],$3 for solution generation, then $J_{\text{GRPO}}(\theta) = \mathbb{E}\Biggl[ \frac{1}{G}\sum_{i=1}^{G} \Bigl( \min\bigl( \rho_i(\theta)\,A_i,\operatorname{clip}(\rho_i(\theta),1-\varepsilon,1+\varepsilon)A_i \bigr) -\beta\,D_{\mathrm{KL}\!\bigl(\pi_\theta\parallel\pi_{\text{ref}}\bigr) \Bigr) \Biggr],$4 for self-correction (Ding et al., 5 Jun 2025).

These methods suggest a broader definition of Dual-GRPO in which the two “sides” are not two reward heads but two factorized subproblems. One side learns to produce an initial trajectory; the other learns to refine, confirm, or correct it. The common GRPO machinery is retained, but the rollout semantics become explicitly hierarchical.

6. Theoretical unification and empirical status

The most formal mathematical account comes from $J_{\text{GRPO}}(\theta) = \mathbb{E}\Biggl[ \frac{1}{G}\sum_{i=1}^{G} \Bigl( \min\bigl( \rho_i(\theta)\,A_i,\operatorname{clip}(\rho_i(\theta),1-\varepsilon,1+\varepsilon)A_i \bigr) -\beta\,D_{\mathrm{KL}\!\bigl(\pi_\theta\parallel\pi_{\text{ref}}\bigr) \Bigr) \Biggr],$5-GRPO and $J_{\text{GRPO}}(\theta) = \mathbb{E}\Biggl[ \frac{1}{G}\sum_{i=1}^{G} \Bigl( \min\bigl( \rho_i(\theta)\,A_i,\operatorname{clip}(\rho_i(\theta),1-\varepsilon,1+\varepsilon)A_i \bigr) -\beta\,D_{\mathrm{KL}\!\bigl(\pi_\theta\parallel\pi_{\text{ref}}\bigr) \Bigr) \Biggr],$6-HAL, which recast GRPO-style alignment as variational $J_{\text{GRPO}}(\theta) = \mathbb{E}\Biggl[ \frac{1}{G}\sum_{i=1}^{G} \Bigl( \min\bigl( \rho_i(\theta)\,A_i,\operatorname{clip}(\rho_i(\theta),1-\varepsilon,1+\varepsilon)A_i \bigr) -\beta\,D_{\mathrm{KL}\!\bigl(\pi_\theta\parallel\pi_{\text{ref}}\bigr) \Bigr) \Biggr],$7-divergence estimation between aligned and unaligned distributions (Haldar et al., 5 Feb 2026). In that framework, GRPO is generalized into an on-policy divergence estimator between reward-aligned and reward-unaligned distributions, while $J_{\text{GRPO}}(\theta) = \mathbb{E}\Biggl[ \frac{1}{G}\sum_{i=1}^{G} \Bigl( \min\bigl( \rho_i(\theta)\,A_i,\operatorname{clip}(\rho_i(\theta),1-\varepsilon,1+\varepsilon)A_i \bigr) -\beta\,D_{\mathrm{KL}\!\bigl(\pi_\theta\parallel\pi_{\text{ref}}\bigr) \Bigr) \Biggr],$8-HAL interpolates between on-policy reward alignment and off-policy preference alignment. The paper states that this “provides the mathematical blueprint for a Dual‑GRPO algorithm,” either through explicit dual critics, dual multipliers, or simultaneous optimization of reward-based and preference-based divergences (Haldar et al., 5 Feb 2026). This does not define a single Dual-GRPO loss, but it does supply a unifying interpretation: duality may refer to two distributions, two objectives, or a primal–dual variational structure.

Empirically, the reported gains are heterogeneous but consistently favorable. PortraitGen, the paper that explicitly presents itself as a “Dual-GRPO” interpretation, reaches OmniReward Content 0.97, UnifiedReward Coherence 3.83, and PickScore 22.77 on PortraitBench (Li et al., 25 Jun 2026). GRPO-CARE reports a 6.7% gain on the hardest SEED-Bench-R1 evaluation level together with a 24.5% improvement in reasoning–answer consistency (Chen et al., 19 Jun 2025). MDP-GRPO improves strict constraint satisfaction by up to 5.0% on Llama-3.2-3B under discrete multi-constraint rewards (Salmani-Zarchi et al., 4 Jun 2026). AGPO, which uses dual statistical feedback rather than dual rewards, reaches 67.3% on GSM8K and 40.5% on MATH with Qwen2.5-14B under the same generated-token budget as PPO and GRPO (Hu et al., 20 May 2026). At the minimal pairwise end, 2-GRPO is reported to achieve performance on par with 16-GRPO, using only 1/8 of the rollouts and reducing training time by over 70% (Wu et al., 1 Oct 2025).

A recurring misconception is that Dual-GRPO names one specific algorithm. The surveyed literature suggests otherwise. In current usage, the label can denote dual rewards, dual anchors, dual controllers, dual groups, dual stages, or dual variational objectives. Another misconception is that duality necessarily introduces a learned critic; in nearly all of these variants, the method remains critic-free and keeps GRPO’s group-relative backbone intact. The more precise reading is therefore structural: Dual-GRPO denotes GRPO with two coupled sources of relative guidance, introduced to overcome specific failure modes of single-signal group normalization.

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Dual-GRPO.