GAPO: Group-Aware Policy Optimization
- GAPO is a reinforcement learning framework that extends group-based advantage methods to optimize diversity, coverage, and uniformity of generated responses.
- It replaces scalar per-sample rewards with group-dependent, frequency-aware reward vectors to directly mitigate mode collapse and boost output variation.
- Empirical results demonstrate GAPO's effectiveness in improving metrics like JSD, Unique@N, and 1-Self-BLEU while balancing diversity with baseline accuracy.
Group-Aware Policy Optimization (GAPO) is a @@@@1@@@@ framework for training LLMs that generalizes the group-based advantage methodology of Group Relative Policy Optimization (GRPO) to enable optimization over group-level properties such as diversity, coverage, and uniformity of generated responses. GAPO replaces scalar, per-sample rewards with group-dependent reward vectors, facilitating direct optimization for output diversity and mitigating mode collapse in generative models. By leveraging a frequency-aware reward function, GAPO achieves greater uniformity and diversity in model outputs without sacrificing baseline accuracy, and extends naturally to both closed-set and open-ended tasks in LLM generation scenarios (Anschel et al., 16 Nov 2025).
1. Formalization and Objective
GAPO is formulated as an extension of GRPO, substituting standard per-output reward with a group-level reward for each candidate in a group of model outputs. The group-aware reward depends on all outputs jointly and is designed to capture properties like frequency or diversity among completions.
Notation:
- : current policy parameters
- : policy probability for token of candidate
- : group of rollouts
- : policy used to generate
- : fixed reference policy for KL penalty
- : PPO clipping hyperparameter
- : KL-penalty coefficient
Importance-sampling ratios:
Group-normalized rewards and advantages:
Clipped surrogate loss:
Full objective:
This general formulation provides the substrate for implementing group-consistent learning. When is chosen as a frequency-aware diversity reward, GAPO directly optimizes for uniform output probabilities and diversity, outperforming standard RL-fine-tuning and conventional supervised methods (Anschel et al., 16 Nov 2025).
2. Frequency-Aware Group-Level Reward Mechanism
A key application of GAPO is for frequency-based diversity optimization over a known valid answer set . Empirical frequency for is defined as: Targeting uniformity , the per-output reward is:
- Over-represented outputs () are penalized.
- Under-represented outputs () are rewarded.
- Invalid outputs receive reward .
This frequency penalty mechanism enforces uniform sampling, directly combating mode collapse and promoting response diversity (Anschel et al., 16 Nov 2025).
3. Training Workflow and Pseudocode
GAPO adopts a group-based rollout approach in which, for each prompt , completions are sampled to form a group. Pseudocode for a single training iteration is as follows:
- Sample batch of prompts .
- For each , generate rollouts using .
- For each group :
- Compute empirical frequencies ().
- Calculate group-aware rewards .
- Compute group mean , stddev , and advantages .
- Aggregate clipped surrogate loss using .
- Evaluate .
- Perform a gradient update on .
- Update (Anschel et al., 16 Nov 2025).
Recommended hyperparameters include group size for stable frequency estimation, and, for diversity tasks, a zero KL penalty ().
4. Evaluation Metrics for Diversity and Coverage
GAPO’s impact on LLM response diversity is quantified using several established metrics:
| Metric | Definition | Significance |
|---|---|---|
| Jensen–Shannon Divergence | Uniformity to target | |
| Unique@N | Number of distinct completions in samples | Output coverage |
| Semantic Diversity | (embedding space) | Diversity in meaning |
| Lexical Diversity | $1$-Self-BLEU (lower BLEU = higher lexical diversity) | Surface variation |
Performance on list sampling, open-ended prompts, and creative writing tasks is reported using these measures, with GAPO delivering JSD reductions ( vs. baseline), over increase in Unique@500 for open sets, and up to average increase in semantic diversity in creative tasks (Anschel et al., 16 Nov 2025).
5. Empirical Results and Trade-offs
Experiments applying GAPO to Qwen2.5 models (7B and 32B) reveal the following:
- In closed-set sampling, GAPO achieves JSD, outperforming baselines such as ChatGPT-4o, Claude, and Gemini models, which remain above $0.30$.
- For open-ended prompts, Unique@500 improves from 24 to 147.
- In creative writing, embedding distances and 1-Self-BLEU diversify outputs by 160% and 75%, respectively.
- On code and reasoning benchmarks (GSM8K, MATH, HumanEval, MMLU-Pro), GAPO maintains or slightly improves flexible scoring, e.g., on GSM8K flexible match: , while the exact match may slightly decrease, .
- Across sampling temperatures, GAPO consistently yields higher creativity at a given accuracy, dominating baseline on the creativity–coherence spectrum.
A plausible implication is that the tradeoff between diversity and correctness can be tuned via the reward structure and temperature, with GAPO outperforming in diversity at equal or better accuracy (Anschel et al., 16 Nov 2025).
6. Generalization, Limitations, and Best Practices
GAPO demonstrates notable generalization: although trained solely on synthetic list-sampling, it confers diversity benefits on completely unseen, open-ended prompts across diverse domains. Ablations show that vanilla SFT over all valid completions may achieve in-distribution uniformity but fails to generalize, with Unique@500 plummeting to $3$.
Current limitations include the dependency of frequency-based rewards on an explicit finite valid answer set . Extending GAPO to truly unbounded sets requires alternative group-level reward functions. Mode collapse mitigation may also introduce increased risk of unsafe generations, necessitating continued application of safety filters.
Implementation best practices, as recommended by the source, are:
- Use sufficiently large group sizes () for accurate frequency estimation.
- Omit KL penalty () for diversity-focused tuning; reintroduce if accuracy must be strictly maintained.
- Normalize advantages using group mean and standard deviation for gradient stability.
- Consider entropy regularization for further diversity, and systematically validate that increased diversity does not reduce downstream accuracy (Anschel et al., 16 Nov 2025).
7. Relationship to Prior Work and Extensions
GAPO builds upon and generalizes Group Relative Policy Optimization (GRPO) [see (Yu et al., 12 Sep 2025)], which uses group-relative advantages based on scalar per-sample rewards. In S-GRPO, for example, the reward is a weighted combination of binary compilation, security, and format tests, and the group-based structure supplies refined, prompt-localized learning signals for policy optimization (Yu et al., 12 Sep 2025).
Whereas S-GRPO (and other GRPO derivatives) aim to optimize multiple orthogonal constraints for tasks such as secure code synthesis, GAPO focuses on aggregate properties of the group, such as output diversity, by making reward assignment itself a group-level computation. This enables effective mitigation of mode collapse and provides a versatile template for interventions targeting properties emergent at the group level, including, but not limited to, diversity, coverage, and fairness (Anschel et al., 16 Nov 2025).
The framework is directly applicable to a variety of generative tasks requiring model output diversity, and generalizes to any reward that can be defined as a symmetric group-level function of output samples. Further extensions to settings with unbounded answer sets or to full-model RL fine-tuning remain areas for future investigation.