Distributional Alignment Games for Answer-Level Fine-Tuning

Published 29 Apr 2026 in cs.LG and cs.GT | (2604.27166v1)

Abstract: We focus on the problem of \emph{Answer-Level Fine-Tuning} (ALFT), where the goal is to optimize a LLM based on the correctness or properties of its final answers, rather than the specific reasoning traces used to produce them. Directly optimizing answer-level objectives is computationally intractable due to the need to marginalize over the vast space of latent reasoning paths. To overcome this, we propose a general game-theoretical framework that lifts the problem to a \emph{Distributional Alignment Game}. We formulate ALFT as a two-player game between a Policy (the generator) and a Target (an auxiliary distribution). We prove that the Nash Equilibrium of this game corresponds exactly to the solution of the original answer-level optimization problem. This variational perspective transforms the intractable marginalization problem into a tractable projection problem. We demonstrate that this framework unifies recent approaches to diversity and self-improvement (coherence) and provide efficient algorithms compatible with Group Relative Policy Optimization (GRPO), such as Coherence-GRPO, yielding significant complexity gains in mathematical reasoning tasks.

Abstract PDF Upgrade to Chat

Authors (3)

Summary

The paper presents a novel game-theoretic framework that transforms answer-level fine-tuning into a min-max optimization problem via Fenchel duality.
It leverages GRPO-based algorithms to update policies efficiently, achieving low variance and improved accuracy on benchmarks like GSM8K and TriviaQA.
The method unifies diverse objectives—including reward maximization, diversity, coherence, and safety—offering scalable and robust fine-tuning for large language models.

Distributional Alignment Games for Answer-Level Fine-Tuning: A Technical Overview

Motivation and Problem Formulation

The paper addresses the computationally challenging problem of Answer-Level Fine-Tuning (ALFT) for LLMs (LMs), where optimization targets the correctness and properties of final answers $z$ rather than the latent reasoning traces $y$ (e.g., chain-of-thought or intermediate steps). In domains like mathematical reasoning and code generation, the answer-level objective is preferable, but direct marginalization over latent paths is intractable. Standard alignment protocols, including SFT and DPO, operate on observed traces, but ALFT requires optimizing over the induced marginals of answers, a task complicated by the many-to-one mapping $y \mapsto z$ and combinatorial explosion.

The paper proposes a shift from direct marginalization to a variational, game-theoretical framework: a Distributional Alignment Game between a Policy $T$ (generator) and an auxiliary Target distribution $q$ , leveraging Fenchel duality to decouple the marginalization problem. The Nash equilibrium of this game is shown to recover the solution to the original answer-level objective.

Game-Theoretic Framework and Theoretical Foundation

ALFT is formalized as a convex optimization problem, minimizing the expected loss on answer distributions with KL regularization to a reference policy. Fenchel duality transforms this into a strictly convex-concave min-max game involving the policy and target distribution. The key result is that the Nash equilibrium corresponds to the optimal answer-level policy, and approximate game equilibria yield approximate optimal solutions for ALFT.

Distinct alignment goals are recoverable by instantiating different forms of the answer-level loss $R$ :

Reward maximization: Target is exponentiated reward distribution.
Supervised Fine-Tuning: Target is a Dirac (ground truth).
Diversity: Target maximizes entropy—leading to inverse-frequency rewards.
Safety/Fairness: Target is the information projection onto constraint sets.
Coherence: Consensus or centroid distributions enforce self-alignment under semantic transformations.

Algorithmic Contributions: GRPO-Based Policy Optimization

The framework leads to efficient, practical algorithms by integrating Group Relative Policy Optimization (GRPO)—a policy gradient method with group-based variance reduction. The algorithms alternate between:

Target Step: Estimating optimal $q^*$ from the group (majority, arithmetic mean, geometric mean, etc.)
Policy Step: Updating the policy with reward signals derived from $q^*$ .

Key algorithmic variants include:

GAME-GRPO: Core method for generic answer-level fine-tuning.
DIVERSITY-GRPO: Promotes diversity using empirical marginal estimates; theoretically grounds the inverse-frequency reward [Li et al., 2025].
COHERENCE-GRPO: Enforces orbit-level coherence for self-improvement, using consensus targets (mode or arithmetic mean).
PAIRWISE-GRPO: Adapts to open-ended domains using pairwise semantic disagreement metrics.
SAFETY-GRPO: Enforces distributional constraints (e.g., safety/fairness) via Lagrange multipliers and information projections.

A crucial insight is that GRPO's group-based advantage calculation directly mirrors the Nash equilibrium gradient, yielding robust low-variance estimators even in high-heterogeneity environments. The connection between distributional alignment games and DPO is formalized, with the optimal target $q^*$ corresponding to exponentiated reward distributions.

Diversity and Coherence: Unified, Rigorous Derivations

The framework provides rigorous theoretical justification for previously heuristic methods:

Inverse-frequency rewards for diversity are shown to be optimal for entropy maximization in the alignment game.
Alternative diversity objectives (e.g., Gini index) are derived, offering bounded reward scaling and improved optimization stability.
Coherence/self-improvement is formalized via orbit-level consensus, with provable monotonic improvement when projecting onto coherent policies. The paper analyzes geometric vs. arithmetic mean consensus, providing stability guarantees using general squared Hellinger distance.

The consensus mechanism is tractably implemented using the majority vote (mode) or arithmetic mean, with bounded degradation from the ideal geometric mean in practice.

Pairwise Coherence and Open-Ended Alignment

For domains where discrete answer extraction is infeasible, the pairwise disagreement formulation replaces canonical answers with semantic distance metrics, allowing answer-level alignment to be generalized. The PAIRWISE-GRPO algorithm penalizes incoherence across orbits, minimizing expected pairwise semantic divergence. The reward signal is the orbit centrality, efficiently computable even when answer extraction is brittle.

Distributional Constraints: Safety and Fairness

Distributional constraints (e.g., toxicity, demographic parity) are integrated into the framework via constraint sets on answer-level marginals. The Nash equilibrium is achieved by projecting onto safe or fair distributions using dual-ascent on Lagrange multipliers. SAFETY-GRPO alternates between policy updates and penalty weight adjustments, yielding theoretically justified constrained RL.

Empirical Results and Numerical Findings

The framework is validated in controlled synthetic environments and on large LLMs (Qwen2.5-3B-Instruct, Phi-3-mini-4k-instruct, Llama-3.2-3B-Instruct) using GSM8K and TriviaQA datasets. Notable results include:

GSM8K: PAIRWISE-GRPO and COHERENCE-GRPO improved accuracy by +3.18 to +9.18 percentage points, up to +12.46% relative.
TriviaQA: PAIRWISE-GRPO yielded up to +42.06% relative EM and +18.12% relative F1 improvement, demonstrating substantial answer-level gains.
Variance reduction: GAME-GRPO converged rapidly with negligible variance under high signal-to-bias environments, outperforming REINFORCE with global baselines.

The mode target offers low-entropy sharpening for decisive tasks, with bounded degradation from the ideal geometric mean as group diversity increases.

Implications and Future Directions

The Distributional Alignment Game framework unifies disparate alignment objectives, replacing ad-hoc reward engineering with principled optimization grounded in convex duality. It enables scalable, efficient answer-level fine-tuning for reasoning-intensive tasks, with practical benefits in variance reduction and sample efficiency. The inclusion of safety and fairness constraints makes the framework adaptable to real-world deployment scenarios.

Theoretical implications include tighter connections with duality in RL, information projections, and variational EM. Practically, the framework opens avenues for self-improving models, robust reasoning alignment, and safe/fair deployment in open-ended settings. Future work may explore richer consensus metrics, generalized divergences, online adaptive games, and integration with generative model pretraining.

Conclusion

Distributional Alignment Games provide a rigorous, tractable resolution to the computational bottleneck of answer-level optimization. By reframing ALFT as a min-max game between Policy and Target, the approach subsumes diversity, coherence, and constraint satisfaction under a unified theoretical umbrella. The resulting GRPO-based algorithms are both scalable and robust, achieving strong empirical improvements on reasoning benchmarks and offering provable guarantees on monotonic self-improvement and variance reduction (2604.27166).

Markdown Report Issue