Amortized Group Relative Policy Optimization

Updated 2 July 2026

Amortized Group Relative Policy Optimization (AGRPO) is an RL framework that leverages group-based reward normalization to reduce variance across heterogeneous tasks.
It efficiently aligns policy optimization in domains like molecular design, combinatorial routing, and LLM preference alignment through unified group conditioning.
AGRPO provides stable, unbiased gradients and improved sample efficiency, outperforming traditional baseline-dependent methods on several benchmarks.

Amortized Group Relative Policy Optimization (AGRPO) is a class of baseline-free, group-normalized, on-policy reinforcement learning algorithms designed to efficiently amortize policy optimization across heterogeneous input conditions. AGRPO overcomes the instability and computational inefficiency encountered by conventional baseline-dependent policy gradient methods in tasks characterized by wide inter-instance diversity, including molecular optimization, neural combinatorial optimization, diffusion LLM post-training, and personalized preference alignment. By leveraging group-wise or amortized reward normalization, AGRPO achieves stable and sample-efficient learning while generalizing across instance classes, user groups, or conditional input spaces (Javaid et al., 12 Feb 2026, Sepúlveda et al., 9 Jun 2026, Wang et al., 17 Feb 2026, Zhan, 5 Oct 2025, Ichihara et al., 3 Feb 2026).

1. Algorithmic Foundation and Group-Normalized Gradient Estimators

AGRPO modifies the standard on-policy policy gradient, which estimates an unbiased gradient of the expected reward objective but suffers from high variance and bias when applied across instances with heterogeneous base difficulty or reward scale. Instead of a global baseline or a learned value function, AGRPO centers the advantage estimate for each group—where a group consists of all trajectories sampled from the same conditioning context (e.g., scaffold, graph instance, user preference, or input prompt).

For a batch of $B$ instances (e.g., molecular scaffolds $S_i$ ) and $G$ sampled rollouts per instance, with rewards $r_{i,j}=R(O_{i,j})$ , AGRPO constructs the per-group mean

$\mu_i = \frac{1}{G} \sum_{j=1}^G r_{i,j}$

and the advantage for each trajectory as

$A_{i,j} = r_{i,j} - \mu_i$

In certain settings, an additional normalization by the group standard deviation $\sigma_i$ yields the z-score advantage: $\hat{A}_{i,j} = (r_{i,j} - \mu_i)/(\sigma_i+\epsilon)$ (Sepúlveda et al., 9 Jun 2026, Wang et al., 17 Feb 2026, Ichihara et al., 3 Feb 2026).

The resulting policy gradient estimator is

$\nabla_\theta J(\theta) \approx \frac{1}{B G}\sum_{i=1}^B\sum_{j=1}^G A_{i,j} \sum_{t=0}^{T_{i,j}-1} \nabla_\theta \log \pi_\theta(a_t|s_{<t})$

or its z-score-normalized version (Javaid et al., 12 Feb 2026, Sepúlveda et al., 9 Jun 2026). This approach eliminates the need for an external or delayed baseline policy, resulting in unbiased and stable gradients.

2. Amortization Across Heterogeneous Tasks and Inputs

AGRPO enables the training of a single conditional policy (e.g., a graph transformer, attention-based pointer network, or LLM) that can generate appropriate outputs for widely varying problem instances in a one-pass, fully amortized manner. The core amortization strategy involves group-based reward normalization, which aligns the learning signal across instances of disparate inherent difficulty or reward scale (Javaid et al., 12 Feb 2026).

For example, in molecular optimization, the GRXForm framework leverages a pre-trained graph transformer to construct molecular elaborations conditioned on arbitrary scaffolds. AGRPO fine-tunes this model via group-centered reward advantages, producing an amortized policy that requires no per-instance search or inference-time oracle calls (Javaid et al., 12 Feb 2026). Similarly, in neural combinatorial optimization (NCO), AGRPO is used with autoregressive decoders on TSP and CVRP benchmarks to amortize policy learning across a population of graph instances (Sepúlveda et al., 9 Jun 2026).

In personalized preference alignment for LLMs, AGRPO is adapted to normalize advantages not only within an immediate group of samples but amortized across all historical samples from a given user or preference group. This history-based group normalization preserves minority or rare-group reward signals and yields equitable and balanced updates—addressing the noted bias when batch normalization is used across merged user populations (Wang et al., 17 Feb 2026).

3. Theoretical Properties and Convergence

AGRPO preserves the unbiasedness of the policy gradient by ensuring that the group baseline or normalization term depends only on samples from the same underlying conditional distribution (e.g., $\pi_\theta(\cdot|S_i)$ for molecular scaffolds, or per-preference-group reward histories in personalized AGRPO). This property ensures that the expectation of the gradient remains aligned with the true objective, unlike methods that introduce off-policy baselines or merge heterogeneous reward supports (Javaid et al., 12 Feb 2026, Wang et al., 17 Feb 2026).

By removing between-group reward variance (i.e., instance-dependent baseline shifts), AGRPO yields substantial variance reduction in the advantage signal, facilitating monotonic and stable updates. For instance, in molecular optimization, variance plots show that group-centered advantages prevent the gradient collapse and oscillation seen with global baselines or standard REINFORCE (Javaid et al., 12 Feb 2026). Formal convergence guarantees for group-normalized objectives underpin the application of AGRPO to consensus-based RL objectives and diffusion LLMs (Zhan, 5 Oct 2025, Ichihara et al., 3 Feb 2026). Under standard smoothness and bounded-variance conditions, convergence rates of $S_i$ 0 for SGD are maintained (Ichihara et al., 3 Feb 2026).

4. Implementation in Core Domains

Molecular Optimization:

GRXForm, based on AGRPO, consists of a Graph Transformer backbone (e.g., 10 layers, 16 heads, 512d), teacher-forced on large chemical datasets for valence and syntax pretraining, then fine-tuned using AGRPO with a modest computational budget (50K oracle calls). At each step, a batch of scaffolds is expanded into groups of completions, evaluated by an oracle reward, group-mean centered, and used to update the frozen policy (Javaid et al., 12 Feb 2026).

Neural Combinatorial Optimization:

AGRPO is applied using group size $S_i$ 1 (typically $S_i$ 2), with policy gradients constructed per instance over z-score normalized returns. The amortized attention-based decoder uses multi-head Transformer blocks with autoregressive pointer decoding (Sepúlveda et al., 9 Jun 2026).

Preference Alignment in LLMs:

Personalized AGRPO (P-GRPO) maintains online running statistics ( $S_i$ 3) for each preference cluster $S_i$ 4 using Welford's algorithm, and computes normalized advantages for all trajectories and tokens with respect to group-specific history, decoupling sample-wise normalization from global batch statistics. This yields improved convergence, group fidelity, and overall accuracy over standard GRPO (Wang et al., 17 Feb 2026).

Diffusion LLMs:

For dLLMs, the group-sampled reward normalization and advantage computation is extended to iterative unmasking actions, with Monte Carlo estimation over steps to maintain unbiased gradients. AGRPO outperforms previous heuristics and baseline-dependent methods in tractably training diffusion models for math and reasoning tasks (Zhan, 5 Oct 2025).

Consensus Decoding Distillation:

AGRPO underpins Consensus-GRPO, where group-wise consensus utility (e.g., BLEURT or ROUGE-L against peer samples) substitutes for gold-reference rewards, enabling convergence to MBR-optimal policies and quality exceeding sample-and-rerank at a fraction of inference cost (Ichihara et al., 3 Feb 2026).

5. Empirical Impact and Benchmark Results

AGRPO consistently delivers robust sample efficiency and generalization across application domains:

Kinase Scaffold Decoration: GRXForm-AGRPO attains top-1 multi-objective optimization scores (0.618±0.004, 17.8% strict success) greatly exceeding GraphXForm, LibINVENT, DrugEx v3, and Mol GA baselines with no test-time oracles (Javaid et al., 12 Feb 2026).
NCO (TSP/CVRP): On TSP-100, AGRPO maintains stable cost (~8.20) across seeds, avoiding catastrophic failure seen in REINFORCE with rollout baselines. Solution quality is within 2% of strong multi-start AM baselines like POMO, with no external baseline required (Sepúlveda et al., 9 Jun 2026).
Personalized Alignment: Top-1 movie recommendation accuracy and generation task rewards are consistently higher under P-GRPO than standard GRPO across LLM scales and datasets. Super-class clustering amplifies gains; random clusters ablate the effect (Wang et al., 17 Feb 2026).
Diffusion LLMs: On GSM8K, AGRPO achieves 87.3% accuracy (+7.6%), and on Countdown, 40% (≈3.8× baseline), outperforming both reference and non-reference RL post-training alternatives (Zhan, 5 Oct 2025).
Consensus Generation: On WMT 2024, Consensus-GRPO and Dr-GRPO surpass MBR decoding in COMET score while reducing inference cost by 1–2 orders of magnitude; similar gains are obtained for summarization (XSum, ROUGE-L_sum) (Ichihara et al., 3 Feb 2026).

6. Limitations, Practical Considerations, and Extensions

AGRPO relies on adequate trajectory diversity within sampled groups; for easy tasks or narrow policies, group variance may collapse, requiring adaptive temperature schedules or additive $S_i$ 5 for numerical stability (Sepúlveda et al., 9 Jun 2026). In molecular optimization, action spaces limited to atom/bond addition exclude challenging morphing or fragment-linking. History-based normalization in personalized AGRPO assumes accurate tracking of group assignments and sufficient per-group training data (Wang et al., 17 Feb 2026). Training overhead is higher than standard SFT, but inference is amortized to a single forward pass in nearly all applications (Ichihara et al., 3 Feb 2026).

Extensions include hybridization with instance-level search/refinement, integration with surrogate or multi-fidelity offline oracles, dynamic group sizing, and broader application to unseen combinatorial or structured prediction tasks (Javaid et al., 12 Feb 2026, Sepúlveda et al., 9 Jun 2026). In personalized settings, adaptive clustering and amortized reward histories further improve representational equity (Wang et al., 17 Feb 2026).

7. Summary Table: AGRPO Application Scope

Domain	Policy Backbone	Group Definition	Key Empirical Gains
Molecular Design	Graph Transformer	Per-scaffold completions	SOTA OOD generalization, stable RL (Javaid et al., 12 Feb 2026)
Combinatorial Routing	Pointer-style Transformer	Per-instance rollouts	Baseline-free, matches POMO/AM (Sepúlveda et al., 9 Jun 2026)
LLM Preference Alignment	LM (Prompt+Preference group)	Per-group, amortized history	Group-fair convergence, generality (Wang et al., 17 Feb 2026)
Diffusion LLM	dLLM, iterative unmasking	Grouped MC sample of steps	Large accuracy boosts, tractable RL (Zhan, 5 Oct 2025)
Consensus Text Generation	LM + sample group (MBR surrogate)	Peer candidate set	Amortized MBR decoding, reduces cost (Ichihara et al., 3 Feb 2026)

In conclusion, AGRPO constitutes a robust baseline-free methodology for amortized policy optimization under structural, task-based, or user-group heterogeneity. Its consistent adoption across molecular design, combinatorial optimization, LLM alignment, non-autoregressive LMs, and consensus learning highlights its versatility and empirical efficacy (Javaid et al., 12 Feb 2026, Sepúlveda et al., 9 Jun 2026, Wang et al., 17 Feb 2026, Zhan, 5 Oct 2025, Ichihara et al., 3 Feb 2026).