Papers
Topics
Authors
Recent
Search
2000 character limit reached

GAPO: Group-Aware Policy Optimization

Updated 23 November 2025
  • GAPO is a reinforcement learning framework that extends group-based advantage methods to optimize diversity, coverage, and uniformity of generated responses.
  • It replaces scalar per-sample rewards with group-dependent, frequency-aware reward vectors to directly mitigate mode collapse and boost output variation.
  • Empirical results demonstrate GAPO's effectiveness in improving metrics like JSD, Unique@N, and 1-Self-BLEU while balancing diversity with baseline accuracy.

Group-Aware Policy Optimization (GAPO) is a @@@@1@@@@ framework for training LLMs that generalizes the group-based advantage methodology of Group Relative Policy Optimization (GRPO) to enable optimization over group-level properties such as diversity, coverage, and uniformity of generated responses. GAPO replaces scalar, per-sample rewards with group-dependent reward vectors, facilitating direct optimization for output diversity and mitigating mode collapse in generative models. By leveraging a frequency-aware reward function, GAPO achieves greater uniformity and diversity in model outputs without sacrificing baseline accuracy, and extends naturally to both closed-set and open-ended tasks in LLM generation scenarios (Anschel et al., 16 Nov 2025).

1. Formalization and Objective

GAPO is formulated as an extension of GRPO, substituting standard per-output reward rir_i with a group-level reward R~(o)i\tilde R(\mathbf o)_i for each candidate in a group o={o1,,oG}\mathbf o = \{o_1, \ldots, o_G\} of GG model outputs. The group-aware reward depends on all GG outputs jointly and is designed to capture properties like frequency or diversity among completions.

Notation:

  • θ\theta: current policy parameters
  • πθ(oi,tq,oi,<t)\pi_\theta(o_{i,t} | q, o_{i,<t}): policy probability for token tt of candidate ii
  • o={o1,,oG}\mathbf o = \{o_1, \ldots, o_G\}: group of rollouts
  • πθold\pi_{\theta_\text{old}}: policy used to generate o\mathbf o
  • πref\pi_\text{ref}: fixed reference policy for KL penalty
  • ϵ\epsilon: PPO clipping hyperparameter
  • β\beta: KL-penalty coefficient

Importance-sampling ratios:

ρi,t(θ)=πθ(oi,tq,oi,<t)πθold(oi,tq,oi,<t)\rho_{i,t}(\theta) = \frac{\pi_\theta(o_{i,t} | q, o_{i,<t})}{\pi_{\theta_\text{old}}(o_{i,t} | q, o_{i,<t})}

Group-normalized rewards and advantages:

ri=R~(o)i,rˉ=1Gj=1Grj,σr=1Gj(rjrˉ)2r_i = \tilde R(\mathbf o)_i, \quad \bar r = \frac{1}{G}\sum_{j=1}^G r_j, \quad \sigma_r = \sqrt{\frac{1}{G}\sum_j (r_j - \bar r)^2}

A^i,t=rirˉσr\hat A_{i,t} = \frac{r_i - \bar r}{\sigma_r}

Clipped surrogate loss:

Lclip(θ)=1Gi=1G1oit=1oimin[ρi,t(θ)A^i,t,clip(ρi,t(θ),1ϵ,1+ϵ)A^i,t]\mathcal{L}_\mathrm{clip}(\theta) = \frac{1}{G}\sum_{i=1}^G \frac{1}{|o_i|} \sum_{t=1}^{|o_i|} \min\left[ \rho_{i,t}(\theta)\hat A_{i,t},\, \mathrm{clip}(\rho_{i,t}(\theta), 1-\epsilon, 1+\epsilon)\hat A_{i,t} \right]

Full objective:

JGAPO(θ)=Lclip(θ)βDKL[πθπref]J_\mathrm{GAPO}(\theta) = \mathcal{L}_\mathrm{clip}(\theta) - \beta \, D_{KL}[\pi_\theta \| \pi_\text{ref}]

This general formulation provides the substrate for implementing group-consistent learning. When R~\tilde R is chosen as a frequency-aware diversity reward, GAPO directly optimizes for uniform output probabilities and diversity, outperforming standard RL-fine-tuning and conventional supervised methods (Anschel et al., 16 Nov 2025).

2. Frequency-Aware Group-Level Reward Mechanism

A key application of GAPO is for frequency-based diversity optimization over a known valid answer set V={v1,...,vL}\mathcal V = \{v_1, ..., v_L\}. Empirical frequency for vVv \in \mathcal V is defined as: fv(o)=i=1G1{oi=v}i=1G1{oiV}f_v(\mathbf o) = \frac{\sum_{i=1}^G \mathbf{1}\{o_i = v\}}{\sum_{i=1}^G \mathbf{1}\{o_i \in \mathcal V\}} Targeting uniformity uv=1/Lu_v = 1/L, the per-output reward is: R~(o)i={1(foi1L),oiV 1,oiV \tilde R(\mathbf o)_i = \begin{cases} 1 - (f_{o_i} - \tfrac{1}{L}), & o_i \in \mathcal V \ -1, & o_i \notin \mathcal V \ \end{cases}

  • Over-represented outputs (foi>1/Lf_{o_i} > 1/L) are penalized.
  • Under-represented outputs (foi<1/Lf_{o_i} < 1/L) are rewarded.
  • Invalid outputs receive reward 1-1.

This frequency penalty mechanism enforces uniform sampling, directly combating mode collapse and promoting response diversity (Anschel et al., 16 Nov 2025).

3. Training Workflow and Pseudocode

GAPO adopts a group-based rollout approach in which, for each prompt qq, GG completions are sampled to form a group. Pseudocode for a single training iteration is as follows:

  1. Sample batch of prompts q(b)q^{(b)}.
  2. For each q(b)q^{(b)}, generate GG rollouts o(b)={o1(b),...,oG(b)}\mathbf o^{(b)} = \{o_1^{(b)}, ..., o_G^{(b)}\} using πθold\pi_{\theta_\text{old}}.
  3. For each group o(b)\mathbf o^{(b)}:
    • Compute empirical frequencies fv(o(b))f_v(\mathbf o^{(b)}) (vVv \in \mathcal V).
    • Calculate group-aware rewards ri(b)=R~(o(b))ir_i^{(b)} = \tilde R(\mathbf o^{(b)})_i.
    • Compute group mean rˉ\bar r, stddev σr\sigma_r, and advantages A^i,t\hat A_{i,t}.
  4. Aggregate clipped surrogate loss Lclip(θ)\mathcal{L}_\mathrm{clip}(\theta) using ρi,t(θ)\rho_{i,t}(\theta).
  5. Evaluate JGAPO(θ)J_\mathrm{GAPO}(\theta).
  6. Perform a gradient update on θ\theta.
  7. Update θoldθ\theta_{\mathrm{old}} \leftarrow \theta (Anschel et al., 16 Nov 2025).

Recommended hyperparameters include group size G=32G = 32 for stable frequency estimation, and, for diversity tasks, a zero KL penalty (β=0\beta=0).

4. Evaluation Metrics for Diversity and Coverage

GAPO’s impact on LLM response diversity is quantified using several established metrics:

Metric Definition Significance
Jensen–Shannon Divergence JSD(PU)=12DKL(PP+U2)+12DKL(UP+U2)\mathrm{JSD}(P \| U) = \frac{1}{2}D_{KL}(P\|\tfrac{P+U}{2}) + \frac{1}{2}D_{KL}(U\|\tfrac{P+U}{2}) Uniformity to target
Unique@N Number of distinct completions in NN samples Output coverage
Semantic Diversity 2N(N1)i<j[1cos(ei,ej)]\frac{2}{N(N-1)} \sum_{i<j}\left[1-\cos(e_i,e_j)\right] (embedding space) Diversity in meaning
Lexical Diversity $1$-Self-BLEU (lower BLEU = higher lexical diversity) Surface variation

Performance on list sampling, open-ended prompts, and creative writing tasks is reported using these measures, with GAPO delivering JSD reductions (<0.10<0.10 vs. >0.30>0.30 baseline), over 6×6\times increase in Unique@500 for open sets, and up to 160%160\% average increase in semantic diversity in creative tasks (Anschel et al., 16 Nov 2025).

5. Empirical Results and Trade-offs

Experiments applying GAPO to Qwen2.5 models (7B and 32B) reveal the following:

  • In closed-set sampling, GAPO achieves JSD<0.10<0.10, outperforming baselines such as ChatGPT-4o, Claude, and Gemini models, which remain above $0.30$.
  • For open-ended prompts, Unique@500 improves from \sim24 to \sim147.
  • In creative writing, embedding distances and 1-Self-BLEU diversify outputs by 160% and 75%, respectively.
  • On code and reasoning benchmarks (GSM8K, MATH, HumanEval, MMLU-Pro), GAPO maintains or slightly improves flexible scoring, e.g., on GSM8K flexible match: 0.8650.9050.865 \rightarrow 0.905, while the exact match may slightly decrease, 0.8350.7720.835 \rightarrow 0.772.
  • Across sampling temperatures, GAPO consistently yields higher creativity at a given accuracy, dominating baseline on the creativity–coherence spectrum.

A plausible implication is that the tradeoff between diversity and correctness can be tuned via the reward structure and temperature, with GAPO outperforming in diversity at equal or better accuracy (Anschel et al., 16 Nov 2025).

6. Generalization, Limitations, and Best Practices

GAPO demonstrates notable generalization: although trained solely on synthetic list-sampling, it confers diversity benefits on completely unseen, open-ended prompts across diverse domains. Ablations show that vanilla SFT over all valid completions may achieve in-distribution uniformity but fails to generalize, with Unique@500 plummeting to $3$.

Current limitations include the dependency of frequency-based rewards on an explicit finite valid answer set V\mathcal V. Extending GAPO to truly unbounded sets requires alternative group-level reward functions. Mode collapse mitigation may also introduce increased risk of unsafe generations, necessitating continued application of safety filters.

Implementation best practices, as recommended by the source, are:

  • Use sufficiently large group sizes (G=32G=32) for accurate frequency estimation.
  • Omit KL penalty (β=0\beta=0) for diversity-focused tuning; reintroduce if accuracy must be strictly maintained.
  • Normalize advantages using group mean and standard deviation for gradient stability.
  • Consider entropy regularization for further diversity, and systematically validate that increased diversity does not reduce downstream accuracy (Anschel et al., 16 Nov 2025).

7. Relationship to Prior Work and Extensions

GAPO builds upon and generalizes Group Relative Policy Optimization (GRPO) [see (Yu et al., 12 Sep 2025)], which uses group-relative advantages based on scalar per-sample rewards. In S-GRPO, for example, the reward is a weighted combination of binary compilation, security, and format tests, and the group-based structure supplies refined, prompt-localized learning signals for policy optimization (Yu et al., 12 Sep 2025).

Whereas S-GRPO (and other GRPO derivatives) aim to optimize multiple orthogonal constraints for tasks such as secure code synthesis, GAPO focuses on aggregate properties of the group, such as output diversity, by making reward assignment itself a group-level computation. This enables effective mitigation of mode collapse and provides a versatile template for interventions targeting properties emergent at the group level, including, but not limited to, diversity, coverage, and fairness (Anschel et al., 16 Nov 2025).

The framework is directly applicable to a variety of generative tasks requiring model output diversity, and generalizes to any reward that can be defined as a symmetric group-level function of output samples. Further extensions to settings with unbounded answer sets or to full-model RL fine-tuning remain areas for future investigation.

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Group-Aware Policy Optimization (GAPO).