DGPO: Beyond Pairwise Preferences with Directional Consistent Groupwise Optimization

Published 11 May 2026 in cs.CL | (2605.10863v1)

Abstract: Although LLMs have made remarkable progress, current preference optimization methods still struggle to align directional consistency while preserving reasoning diversity. To address this limitation, we propose Directional-Groupwise Preference Optimization (DGPO), a lightweight framework that aggregates supervision signals at the group level and explicitly models direction-aware alignment through multi-candidate comparisons. DGPO organizes forward and reverse question-answer instances into structured sets and optimizes a margin-based likelihood objective that separates coherent reasoning paths from inconsistent alternatives. This group-wise formulation captures richer relative information than pairwise objectives and reinforces consistency across diverse reasoning pathways. Empirical results show that our constructed reverse data yields a 3.2% average improvement across five benchmarks, while DGPO further delivers consistent gains across multiple datasets and model families, achieving average accuracy improvements of up to 3.6%.

Abstract PDF Upgrade to Chat

Authors (7)

Summary

The paper introduces DGPO, which aggregates group-level supervision to enforce directional consistency in LLM alignment.
It employs a margin-based contrastive loss and variance regularization to boost both forward and reverse inference performance.
Empirical results show DGPO improves accuracy by up to 3.6% over pairwise methods, enhancing robustness in out-of-distribution tasks.

Directional Consistent Groupwise Optimization for LLM Alignment: An Analysis of DGPO

Motivation and Context

Preference optimization has become pivotal in aligning LLMs toward specific behavioral criteria, historically centered on pairwise preference learning. However, this convention inadequately models the bidirectional and path-diverse properties of human reasoning. Existing alignment approaches typically bias training toward forward-chained reasoning, neglecting reverse inference and failing to enforce directional coherence across multiple solution candidates. Moreover, they often collapse the diversity of valid reasoning paths, hampering the model's generalization in complex domains. Task inversion problems—epitomized by the "Reversal Curse"—further underscore that LLMs often fail to internalize bidirectional semantic links and may not generalize under transformations of known-to-unknown mappings, rendering previous data augmentation with reverse exemplars insufficiently effective.

DGPO ("Directional-Groupwise Preference Optimization") (2605.10863) directly addresses these gaps. It proposes a novel framework for LLM alignment that explicitly aggregates group-level supervision signals, encodes direction-aware reasoning consistency, and incorporates epistemic uncertainty in the optimization objective. This is accomplished through structured multi-candidate comparisons of forward and reverse problem instances, reinforced by a margin-based likelihood objective that is sensitive to the quality, directionality, and diversity of reasoning paths.

Prior direct preference optimization (DPO) methods and their variants (e.g., SimPO, KTO, B-DPO) have improved technical alignment protocols, but fundamentally remain pairwise and lack mechanisms for group-level directionality modeling. In parallel, backward or reverse supervision paradigms (e.g., MathGenie, Reverse Thinking, ReSocratic, and the Reason-from-Future approaches) have shown the importance of bidirectional data augmentation, but either neglect direct modeling of directional signals or rely heavily on teacher model distillation. Existing groupwise optimizations (GRPO, TreeRPO, DARS, Posterior-GRPO) improve diversity and robustness, yet have not targeted the explicit alignment of directionality within group-structured preference learning.

DGPO advances the field by merging three axes previously treated in isolation:

Directional Consistency: Differentiates forward and reverse solution groups for each problem instance.
Diversity Preservation: Supervises on sets of alternative solution paths, not single reference outputs.
Uncertainty-Aware Aggregation: Incorporates estimated uncertainty to regulate group-level preference strength.

Methodological Innovations

DGPO initiates from a set of highly curated reasoning probes (LIMO dataset) and, for each instance, synthesizes both forward and reverse questions. Using teacher models (DeepSeek V3 for question generation, Qwen3-32B for solution enumeration and Qwen3-8B for fact-checking), multiple complete reasoning trajectories are generated per direction. Each prompt thus induces:

A preferred solution set (direction-consistent with the prompt),
A dispreferred set (solutions from the opposite direction).

Directional consistency for each (prompt, solution) tuple is estimated via a Beta-distributed posterior, parameterized by a trainable head on the model's final hidden representation, with both mean and uncertainty propagated through the optimization. The core preference score integrates policy likelihood, directional consistency (log-probability), and explicit uncertainty penalties. Group scores are aggregated using temperature-scaled log-sum-exp, acting as a smooth surrogate for winner selection.

The primary DGPO training objective is a contrastive loss that maximizes a margin between aggregated preferred and dispreferred group scores. Additional regularization terms include:

Directional KL penalty: Aligns posterior consistency estimates with asymmetric Beta priors reflecting expected directional alignment.
Variance regularization: Penalizes high predictive uncertainty within groups.

This groupwise, direction-aware setup extends pairwise preference losses and provides a principled architectural route for jointly optimizing reasoning consistency and pathway diversity.

Empirical Findings

Performance

Across five benchmarks—OpenAI Math 500, AIME-25, GPQA, Gaokao MathQA, and LMGH—DGPO delivers mean accuracy improvements of up to 3.6% over baselines, and 3.2% relative to reverse-only data augmentation alone. On curated SFT and RL-aligned backbones (Qwen3-1.7B-Base and Qwen3-1.7B), DGPO outperforms DPO, SimPO, and GRPO variants, especially on out-of-distribution and reverse-inference tasks (e.g., AIME-25, GPQA), demonstrating superior generalization. Substantial gains are observed even with strong base models, with DGPO raising average accuracy from 27.5% to 30.9% in the large Qwen3-1.7B setting.

Ablation and Scaling

Ablation reveals that both directional consistency modeling and variance regularization are vital: omitting either reduces average accuracy by 1.8–2.9%. Scaling the number of reverse groups per problem uncovers that moderate augmentation benefits low-capacity models, while performance gains in larger models saturate after 1–2 reverse groups, confirming that excessive group diversification may induce representational interference rather than synergy.

Critically, naive mixing of forward and reverse-data SFT harms overall quality (notably reducing mean accuracy by up to 2.1%), confirming that directional conflicts degrade learned representations if not managed via structured groupwise optimization.

Qualitative Analysis

Case studies underscore the main advantage of DGPO. Unlike vanilla DPO, which may select algebraically plausible but contextually invalid solutions, DGPO consistently filters out direction-incoherent answer chains, thereby delivering more robust reasoning pipelines and final outputs.

Implications and Future Directions

Practically, DGPO provides a lightweight, scalable methodology for enhancing reasoning alignment, especially pertinent for applications that require bidirectional inferential robustness (mathematical assistants, scientific LLMs, process-verification agents). Its groupwise, confidence-penalized preference optimization design is orthogonal to model scale and architecture, making it compatible with both base and RL-tuned models.

Theoretically, DGPO motivates a broader reevaluation of preference learning paradigms: from pairwise, direction-agnostic protocols to frameworks that treat reasoning directions and pathway diversity as first-class supervisory signals. However, DGPO's efficacy is partially contingent on the quality of reverse-problem construction; future work may target automated validation and expansion of reverse problem spaces, as well as extension to domains with even higher reasoning dimensionality (e.g., scientific discovery, formal theorem proving).

Another promising trajectory lies in integrating DGPO-style objectives with online RLHF loops and curriculum-based data synthesis, which could further reinforce adaptive alignment across complex multi-hop tasks.

Conclusion

DGPO advances the technical state of preference-based LLM alignment by introducing a groupwise, direction-aware training protocol that explicitly models and regularizes directional consistency and intra-group uncertainty. Empirical evidence confirms that this approach not only surpasses standard pairwise DPO and its recent variants but also stabilizes generalization on out-of-distribution, reasoning-intensive tasks. The findings recommend the adoption of structured, bidirectionally informed groupwise objectives as a general principle for future alignment and reasoning research in LLMs.

Markdown Report Issue