Direct Group Preference Optimization (DGPO)

Updated 4 July 2026

Direct Group Preference Optimization (DGPO) is an extension of Direct Preference Optimization that optimizes group-level comparisons instead of isolated pairs.
DGPO employs techniques like group-normalized rewards and relative advantage scoring to aggregate and refine preference signals across various applications.
Empirical results indicate that DGPO improves model alignment and performance in tasks such as text–video retrieval, diffusion modeling, and image super-resolution while reducing training costs.

Direct Group Preference Optimization (DGPO) denotes a class of methods that extend Direct Preference Optimization (DPO) beyond a single preferred–dispreferred pair and instead optimize over groups of responses, samples, captions, or belief-conditioned outputs. In current arXiv usage, the label appears both explicitly and by close analogy: as “Dual-Group Direct Preference Optimization” for text–video retrieval, “Group Preference Optimization” for diffusion self-improvement, “GroupDPO” for multi-response language-model alignment, “GDPO” for one-step image super-resolution, “DGPO” for online diffusion reinforcement learning with deterministic ODE samplers, and “Group Distributional Preference Optimization” for pluralistic alignment (Lee et al., 20 Sep 2025, Chen et al., 16 May 2025, Leng et al., 17 Apr 2026, Yi et al., 16 Mar 2026, Luo et al., 9 Oct 2025, Yao et al., 2024). The common structure is the replacement of isolated pairwise supervision with grouped comparisons, group-normalized rewards, or group-conditioned preference distributions.

1. From pairwise DPO to grouped preference optimization

Standard DPO is formulated on a preference dataset of triplets $(x,y_w,y_l)$ , with $y_w$ preferred to $y_l$ , and uses an implicit reward proxy

$\hat r_\theta(x,y)=\beta \log \frac{\pi_\theta(y\mid x)}{\pi_{\mathrm{ref}}(y\mid x)},$

together with a logistic loss

$\mathcal{L}_{\mathrm{DPO}}(\pi_\theta;\pi_{\mathrm{ref}}) = -\mathbb{E}_{(x,y_w,y_l)} \left[ \log \sigma\bigl(\hat r_\theta(x,y_w)-\hat r_\theta(x,y_l)\bigr) \right].$

This construction enforces that the preferred output receive a larger implicit reward than the dispreferred output (Lee et al., 20 Sep 2025).

DGPO variants alter the comparison unit rather than the basic preference-learning intuition. In the language-model setting, GroupDPO starts from the observation that most existing methods train on a single positive-negative pair per prompt even though preference datasets often contain multiple candidate responses; it therefore defines a group $g$ with positive set $P_g$ and negative set $N_g$ , and optimizes a group-level loss

$\mathcal{L}_{\text{group}}(\theta)=\frac{1}{G}\sum_{g=1}^{G}\phi_g\big(u_{P_g}(\theta),u_{N_g}(\theta)\big),$

where $u_\theta(y\mid x)=\beta(\log \pi_\theta(y\mid x)-\log \pi_{\mathrm{ref}}(y\mid x))$ and $y_w$ 0 may be a mean-gap, all-pairs, MPO, or softmax objective (Leng et al., 17 Apr 2026).

Other DGPO formulations generalize DPO in different directions. Dual-Group DPO introduces both local within-input preferences and global cross-input preferences for caption generation in retrieval (Lee et al., 20 Sep 2025). Diffusion-oriented GPO and GDPO replace pairwise labels by groupwise reward statistics or group-relative advantages (Chen et al., 16 May 2025, Yi et al., 16 Mar 2026). Group Distributional Preference Optimization factorizes preference alignment into belief-distribution calibration and belief-conditioned preference optimization, so that the model aligns to a distribution of group preferences rather than a single dominant one (Yao et al., 2024).

2. Recurrent mathematical constructions

A first recurrent construction is the positive–negative set partition. GroupDPO assumes only a partial order: all elements of $y_w$ 1 are preferred to all elements of $y_w$ 2, while responses within each set remain unordered. This supports objectives such as Margin DPO, which contrasts the mean positive score against the mean negative score, and All-Pairs DPO, which averages a DPO term over every positive–negative pair in the group (Leng et al., 17 Apr 2026).

A second construction is group-normalized reward weighting. In diffusion-model GPO, one generates a group $y_w$ 3, computes scalar rewards $y_w$ 4, and standardizes them within the group:

$y_w$ 5

The loss then becomes a weighted sum of policy–reference denoising-score differences over group members, so that samples above the group mean receive positive coefficients and samples below the mean receive negative coefficients. This makes the optimization margin-aware rather than purely ordinal (Chen et al., 16 May 2025).

A third construction is group-relative advantage inside a DPO-like surrogate. In GDPO for one-step generative image super-resolution, a group of $y_w$ 6 online-generated outputs for the same input is scored by an attribute-aware reward function, standardized into

$y_w$ 7

and inserted into a DPO-style log-sigmoid objective through a weighted sum of policy–reference squared noise-prediction errors (Yi et al., 16 Mar 2026).

A fourth construction is cross-input global ranking. Dual-Group DPO for captioning defines local preferences among captions of the same video and global preferences among video–caption pairs from different videos. Its loss

$y_w$ 8

uses $y_w$ 9 to balance within-video and cross-video terms, thereby imposing a global ordering aligned with retrieval scores rather than only local within-input rankings (Lee et al., 20 Sep 2025).

A fifth construction is group-conditional latent structure. In Group Distributional Preference Optimization, the model is factorized as

$y_l$ 0

and the training objective combines belief-distribution calibration with a belief-conditioned DPO-style loss. This directly targets the distribution of preferences within a group rather than collapsing to a single majority preference (Yao et al., 2024).

3. Principal variants and nomenclature

The literature does not use a single standardized name for groupwise DPO extensions. Several papers explicitly use DGPO or GDPO, while others use titles such as GroupDPO or GPO but describe methods that the authors themselves situate as group-based generalizations of DPO (Chen et al., 16 May 2025, Leng et al., 17 Apr 2026, Luo et al., 9 Oct 2025).

Variant	Core group definition	Domain
Dual-Group DPO	Local caption groups for one video plus global cross-video pairs	Text–video retrieval
GPO	$y_l$ 1 samples per prompt with z-scored rewards	Text-to-image diffusion
GroupDPO	Positive and negative response sets per prompt	LLM alignment
GDPO	Online-generated sample group with group-relative advantages	One-step image super-resolution
DGPO	Positive and negative subsets inside rollout groups	Online diffusion RL
Directional-Groupwise Preference Optimization	Forward and reverse solution groups	Mathematical reasoning
Group Distributional Preference Optimization	Belief-conditioned responses under a target belief distribution	Pluralistic alignment

Dual-Group DPO is the most explicit retrieval-oriented use of the term. It supervises caption generation using retrieval preference scores and combines local and global caption groups, with the claim that standard single-group DPO does not enforce an absolute retrieval scale across videos (Lee et al., 20 Sep 2025). GPO for diffusion models is framed as extending DPO from pairwise to groupwise preferences and replacing hard pair labels with standardized reward coefficients, yielding a self-improvement loop that does not require external preference data (Chen et al., 16 May 2025).

GroupDPO focuses on offline and online language-model alignment when multiple responses are available per prompt. Its central contribution is not a new preference semantics but a memory-efficient surrogate that preserves first-order gradients while decoupling samples during backpropagation, making larger group sizes practical (Leng et al., 17 Apr 2026). GDPO for one-step super-resolution instead combines a Diffusion-DPO-style objective with GRPO-style group-relative advantages, using online sample groups and an attribute-aware reward function (Yi et al., 16 Mar 2026). The diffusion paper titled “Direct Group Preference Optimization” defines an online reinforcement-learning algorithm that learns from group-level preferences while dispensing with policy gradients and stochastic policies, thereby enabling deterministic ODE rollouts (Luo et al., 9 Oct 2025).

Two further variants broaden the scope of the term. Directional-Groupwise Preference Optimization organizes forward and reverse question–answer instances into structured sets and optimizes a margin-based likelihood over group-aggregated scores, with an uncertainty-aware consistency head (Deng et al., 11 May 2026). Group Distributional Preference Optimization introduces beliefs and belief distributions as the latent organizing structure of a group, thereby moving from grouped comparisons to group-distributional alignment (Yao et al., 2024).

4. Application domains

In text–video retrieval, DGPO appears as a way to make auxiliary captions discriminative for retrieval rather than merely fluent. CaRe-DPO uses an MLLM-based retrieval model to assign preference scores to sampled captions, masks video tokens when constructing the preference score $y_l$ 2, and then trains a captioner with local and cross-video preference pairs. The resulting captions are reported to be more semantically aligned with queries, more descriptive of video content, less redundant by Self-BLEU, and higher in Distinct-1 and Distinct-2 (Lee et al., 20 Sep 2025).

In text-to-image diffusion, groupwise preference optimization is used both for self-improvement and for reinforcement learning. GPO samples groups of images from the current model, scores them with evaluators such as YOLO, PPOCR, BLIP-VQA, ImageReward, MPS, or aesthetic models, standardizes group rewards, and updates the denoiser without adding inference-time overhead (Chen et al., 16 May 2025). The diffusion DGPO paper pursues a different systems goal: it keeps the group-relative preference signal of GRPO while removing policy gradients, so that efficient deterministic ODE samplers can be used during online post-training (Luo et al., 9 Oct 2025).

In one-step generative image super-resolution, GDPO addresses a setting where standard RL methods had focused on multi-step generative ISR. It introduces a noise-aware one-step diffusion model to generate diverse outputs and then performs groupwise optimization using group-relative advantages and an attribute-aware reward function based on PSNR, MANIQA, and MUSIQ, with smooth-versus-texture weighting derived from gradient-entropy regions (Yi et al., 16 Mar 2026).

In large-language-model alignment, GroupDPO targets datasets that already contain multiple responses per prompt and therefore provide richer supervision than a single chosen–rejected pair. The paper studies groupwise objectives such as Margin, All-Pairs, MPO, and Softmax, and also covers both offline alignment and online RL-style settings where groups are formed from current policy samples and rule-based correctness labels (Leng et al., 17 Apr 2026). A closely related theoretical result argues that GRPO can be reframed as contrastive learning and that the minimal two-rollout case, 2-GRPO, can match 16-GRPO while using far fewer rollouts (Wu et al., 1 Oct 2025).

In pluralistic or group-structured alignment, GDPO means something different but related: the objective is to match a distribution of preferences within a group. The belief-conditioned factorization in Group Distributional Preference Optimization is designed to prevent DPO from collapsing onto dominant preferences when conflicting preference pairs arise from different underlying beliefs (Yao et al., 2024). In reasoning tasks, Directional-Groupwise Preference Optimization uses groups of forward and reverse solutions to preserve reasoning diversity while enforcing directional consistency (Deng et al., 11 May 2026).

5. Empirical behavior and systems implications

Reported gains are domain-specific but consistently support the claim that grouped supervision can outperform single-pair training. In text–video retrieval, the CaRe-DPO ablation reports marginal improvement from SFT $y_l$ 3, an average $y_l$ 4 R@1 gain from SG-DPO, and an average $y_l$ 5 R@1 gain from DG-DPO, with DG-DPO consistently better than SG-DPO on all three benchmarks. The same work reports that DG-DPO captioning improves text-to-caption R@1 by up to $y_l$ 6– $y_l$ 7 points on ActivityNet and MSRVTT and improves video-to-caption R@1 by $y_l$ 8– $y_l$ 9 (Lee et al., 20 Sep 2025).

In diffusion self-improvement, GPO reports large gains on targeted capabilities. For Stable Diffusion 3.5 Medium, accurate counting accuracy increases from $\hat r_\theta(x,y)=\beta \log \frac{\pi_\theta(y\mid x)}{\pi_{\mathrm{ref}}(y\mid x)},$ 0 to $\hat r_\theta(x,y)=\beta \log \frac{\pi_\theta(y\mid x)}{\pi_{\mathrm{ref}}(y\mid x)},$ 1 and Pass@4 from $\hat r_\theta(x,y)=\beta \log \frac{\pi_\theta(y\mid x)}{\pi_{\mathrm{ref}}(y\mid x)},$ 2 to $\hat r_\theta(x,y)=\beta \log \frac{\pi_\theta(y\mid x)}{\pi_{\mathrm{ref}}(y\mid x)},$ 3; text rendering accuracy increases from $\hat r_\theta(x,y)=\beta \log \frac{\pi_\theta(y\mid x)}{\pi_{\mathrm{ref}}(y\mid x)},$ 4 to $\hat r_\theta(x,y)=\beta \log \frac{\pi_\theta(y\mid x)}{\pi_{\mathrm{ref}}(y\mid x)},$ 5 and Pass@4 from $\hat r_\theta(x,y)=\beta \log \frac{\pi_\theta(y\mid x)}{\pi_{\mathrm{ref}}(y\mid x)},$ 6 to $\hat r_\theta(x,y)=\beta \log \frac{\pi_\theta(y\mid x)}{\pi_{\mathrm{ref}}(y\mid x)},$ 7. For Wan 1.3B, counting accuracy increases from $\hat r_\theta(x,y)=\beta \log \frac{\pi_\theta(y\mid x)}{\pi_{\mathrm{ref}}(y\mid x)},$ 8 to $\hat r_\theta(x,y)=\beta \log \frac{\pi_\theta(y\mid x)}{\pi_{\mathrm{ref}}(y\mid x)},$ 9. The paper emphasizes that this comes with no extra runtime cost at inference (Chen et al., 16 May 2025).

In online diffusion reinforcement learning, DGPO reports both higher quality and much lower training cost than Flow-GRPO. On GenEval, SD3.5-M improves from $\mathcal{L}_{\mathrm{DPO}}(\pi_\theta;\pi_{\mathrm{ref}}) = -\mathbb{E}_{(x,y_w,y_l)} \left[ \log \sigma\bigl(\hat r_\theta(x,y_w)-\hat r_\theta(x,y_l)\bigr) \right].$ 0 overall to $\mathcal{L}_{\mathrm{DPO}}(\pi_\theta;\pi_{\mathrm{ref}}) = -\mathbb{E}_{(x,y_w,y_l)} \left[ \log \sigma\bigl(\hat r_\theta(x,y_w)-\hat r_\theta(x,y_l)\bigr) \right].$ 1, and the paper reports training around $\mathcal{L}_{\mathrm{DPO}}(\pi_\theta;\pi_{\mathrm{ref}}) = -\mathbb{E}_{(x,y_w,y_l)} \left[ \log \sigma\bigl(\hat r_\theta(x,y_w)-\hat r_\theta(x,y_l)\bigr) \right].$ 2 times faster than existing state-of-the-art methods and about $\mathcal{L}_{\mathrm{DPO}}(\pi_\theta;\pi_{\mathrm{ref}}) = -\mathbb{E}_{(x,y_w,y_l)} \left[ \log \sigma\bigl(\hat r_\theta(x,y_w)-\hat r_\theta(x,y_l)\bigr) \right].$ 3 times faster than Flow-GRPO on GenEval to reach comparable performance (Luo et al., 9 Oct 2025). In one-step ISR, GDPO is reported to improve both full-reference and no-reference metrics over NAOSD, while also balancing fidelity and perceptual quality better than Diffusion-DPO and DanceGRPO (Yi et al., 16 Mar 2026).

In LLM alignment, GroupDPO reports that leveraging multiple responses consistently outperforms single-pair training in both offline and online settings, and that adding a negative log-likelihood term on positive responses is critical for both performance gains and training stability (Leng et al., 17 Apr 2026). The GRPO–DPO connection paper sharpens the computational point: 2-GRPO achieves performance on par with 16-GRPO while using only $\mathcal{L}_{\mathrm{DPO}}(\pi_\theta;\pi_{\mathrm{ref}}) = -\mathbb{E}_{(x,y_w,y_l)} \left[ \log \sigma\bigl(\hat r_\theta(x,y_w)-\hat r_\theta(x,y_l)\bigr) \right].$ 4 of the rollouts and reducing training time by over $\mathcal{L}_{\mathrm{DPO}}(\pi_\theta;\pi_{\mathrm{ref}}) = -\mathbb{E}_{(x,y_w,y_l)} \left[ \log \sigma\bigl(\hat r_\theta(x,y_w)-\hat r_\theta(x,y_l)\bigr) \right].$ 5 (Wu et al., 1 Oct 2025). In reasoning alignment, Directional-Groupwise Preference Optimization reports a $\mathcal{L}_{\mathrm{DPO}}(\pi_\theta;\pi_{\mathrm{ref}}) = -\mathbb{E}_{(x,y_w,y_l)} \left[ \log \sigma\bigl(\hat r_\theta(x,y_w)-\hat r_\theta(x,y_l)\bigr) \right].$ 6 average improvement from constructed reverse data across five benchmarks and average accuracy improvements of up to $\mathcal{L}_{\mathrm{DPO}}(\pi_\theta;\pi_{\mathrm{ref}}) = -\mathbb{E}_{(x,y_w,y_l)} \left[ \log \sigma\bigl(\hat r_\theta(x,y_w)-\hat r_\theta(x,y_l)\bigr) \right].$ 7 from the DGPO framework itself (Deng et al., 11 May 2026).

These results suggest that “group” is often functioning as a mechanism for extracting denser supervisory structure from data already available—multiple rollouts, multiple candidate responses, multiple captions, or multiple belief-conditioned answers—rather than merely as a larger batch dimension.

6. Limitations, disagreements, and open directions

Current usage suggests that DGPO is better understood as a family of related constructions than as a single canonical algorithm. The underlying objects being grouped vary substantially: response sets, caption sets, rollout groups, belief distributions, or forward–reverse solution sets (Leng et al., 17 Apr 2026, Lee et al., 20 Sep 2025, Yao et al., 2024). A plausible implication is that comparisons across “DGPO” papers are often comparisons across distinct design choices about what constitutes a group and what signal should be shared across its members.

Several limitations recur across the literature. Reward quality and evaluator bias remain central in diffusion-oriented methods: GPO is bounded by base-model capability, can be limited when self-generated groups never contain good examples, and inherits biases from external evaluators such as YOLO, OCR, BLIP-VQA, and ImageReward (Chen et al., 16 May 2025). GDPO for one-step ISR depends on image-specific metrics and a task-specific smooth-versus-detailed region decomposition, so its reward structure is not directly portable without redesign (Yi et al., 16 Mar 2026). The diffusion DGPO paper also leaves text-to-video and broader multi-objective settings as future work (Luo et al., 9 Oct 2025).

Optimization and systems issues also persist. Group-wise losses can be memory-heavy because the objective couples all responses in a group; GroupDPO addresses this with a gradient-equivalent surrogate, but the need for such a surrogate is itself evidence that groupwise optimization introduces systems overhead absent from plain pairwise DPO (Leng et al., 17 Apr 2026). Related analysis of DPO identifies squeezing and probability collapse caused by rejected-response gradients, and proposes gradient gating as a complementary stabilization mechanism. This suggests that moving from pairwise to groupwise supervision does not by itself settle the geometry of preference optimization (Mouiche, 4 May 2026).

Task-specific DGPO variants introduce their own epistemic constraints. Directional-Groupwise Preference Optimization depends on reverse problems generated by another model, and the paper states that these reverse problems are not guaranteed to be perfect inverses; it also notes diminishing returns when too many reverse groups are added to small models and leaves extension beyond math and logic for future work (Deng et al., 11 May 2026). Group Distributional Preference Optimization currently focuses on one group at a time, relies on explicit or derivable belief labels, and notes that extension to high-dimensional or latent belief spaces remains open (Yao et al., 2024).

The main open direction is therefore not merely “larger groups,” but better control over what group structure means. Existing papers point toward several nonexclusive paths: cross-input global ranking, belief-conditioned pluralistic alignment, variational mixtures of preference experts, token-level reward guidance, and geometry-aware gradient control (Lee et al., 20 Sep 2025, Yao et al., 2024, Bohne et al., 9 Oct 2025, Zhu et al., 17 Jun 2025, Mouiche, 4 May 2026). Together they indicate that DGPO is becoming a general design pattern for aligning models to structured preference signals that are richer than a single chosen–rejected pair.