Group Contrastive Policy Optimization (GCPO)

Updated 4 July 2026

Group Contrastive Policy Optimization (GCPO) is a family of grouped reinforcement learning methods that use contrasts between positive and negative rollouts to enhance policy updates.
It employs mechanisms like InfoNCE-style objectives, group reward shaping, and gold-answer substitution to improve credit assignment and address policy degeneracy.
Empirical studies show GCPO variants boost performance in areas such as geometry problem solving and mathematical reasoning, albeit with increased computational overhead.

Group Contrastive Policy Optimization (GCPO) denotes a class of group-based reinforcement learning procedures in which policy updates are driven by explicit contrasts between positive and negative rollouts, candidates, or grouped reward signals. The label is not fully standardized: some papers introduce GCPO as an explicit algorithm, while others use it as a natural interpretation or a concrete instantiation of group-wise contrastive optimization built on Group Relative Policy Optimization (GRPO), InfoNCE-style objectives, or contrastive reward shaping. This suggests that GCPO is best understood as a research lineage rather than a single canonical objective (Wang et al., 8 Jun 2025, Wu et al., 9 Oct 2025, Wu et al., 1 Oct 2025).

1. Terminological scope and nomenclature

The literature uses “GCPO” in both narrow and broad senses. In some works it is the official method name; in others it is an explanatory label for a GRPO-style algorithm with group-wise contrastive structure.

Usage	Status in paper	Defining mechanism
GeometryZero (Wang et al., 8 Jun 2025)	Explicit GCPO	Group Contrastive Masking over auxiliary construction, plus length reward
“When Contrast Fails, Go Gold” (Wu et al., 9 Oct 2025)	Explicit GCPO	Gold-answer substitution when all sampled responses fail; sequence-level importance sampling
“It Takes Two: Your GRPO Is Secretly DPO” (Wu et al., 1 Oct 2025)	GCPO defined as contrastive re-expression of GRPO	Linear positive-negative score difference and InfoNCE-style generalization
GRACE (Sun et al., 6 Oct 2025)	Not explicit GCPO	“A natural interpretation” of contrastive policy optimization with GRPO-style grouping
ConSPO (Zhang et al., 13 May 2026)	Not named GCPO, but presented as concrete instantiation	Group-wise InfoNCE over positive and negative rollouts with likelihood-aligned scores
Neighbor GRPO (He et al., 21 Nov 2025)	Not explicit GCPO	Grouped neighborhood candidates and distance-based softmax surrogate policy
MCPO (Yu et al., 25 May 2026)	Not explicit GCPO	Multi-domain contrastive alignment over grouped rollouts and prompt prototypes

The same acronym is also used for different algorithms that are not “Group Contrastive Policy Optimization,” including “Guidance Contrastive Policy Optimization,” “Group Critical-token Policy Optimization,” “Group Causal Policy Optimization,” and “Group Cooperative Policy Optimization.” This acronym collision is a recurrent source of confusion and should be resolved from the paper title and formal objective rather than from the acronym alone (Li et al., 28 May 2026, Zhang et al., 26 Sep 2025, Gu et al., 7 Aug 2025, Chen et al., 12 May 2026).

2. Core optimization pattern

Several formulations share a common structural backbone: a prompt-conditioned group of rollouts or candidates is sampled; rewards or scores are normalized within the group; positives are contrasted against negatives; and the policy is updated with a clipped surrogate or an advantage-weighted likelihood term. In the contrastive reframing of GRPO, the linear GCPO objective is written as

$L_{\mathrm{GCPO\text{-}lin}}(\theta)= - \mathbb{E}_q \left[ w(q)\left( \frac{1}{|P|}\sum_{o\in P}s_\theta(o\mid q)-\frac{1}{|N|}\sum_{o\in N}s_\theta(o\mid q)\right)\right],$

while an InfoNCE-style variant is

$L_{\mathrm{GCPO\text{-}NCE}}(\theta)= - \mathbb{E}_q \left[\log \sum_{o\in P}\exp(\beta s_\theta(o\mid q))-\log \sum_{o\in P\cup N}\exp(\beta s_\theta(o\mid q))\right].$

In that formulation, GRPO becomes a special case of a linear contrastive objective with a prompt weight $w(q)=\sqrt{\mathrm{Var}(q)}$ and rollout score $s_\theta(o\mid q)=\pi_\theta^{\mathrm{GRPO}}(o\mid q)$ (Wu et al., 1 Oct 2025).

A closely related RLVR formulation appears in ConSPO, which replaces GRPO’s clipped-ratio rollout scores with length-normalized sequence log-probabilities,

$s_\theta(o,q)=\frac{1}{|o|}\sum_{t=1}^{|o|}\log \pi_\theta(o_t\mid q,o_{<t}),$

and then contrasts each positive rollout against negative distractors from the same group through a group-wise InfoNCE objective with temperature $\tau$ and a curriculum-scheduled margin. The paper identifies two GRPO limitations—“likelihood-misaligned scoring” and “score-insensitive credit assignment”—and uses contrastive softmax weighting to amplify updates for poorly separated positives while concentrating suppressive updates on high-scoring negatives (Zhang et al., 13 May 2026).

GRACE supplies a representation-learning variant of the same underlying idea. There, the model is treated as a policy $\pi_\theta$ that generates explicit rationales, and rewards are constructed from query–positive similarity, negative similarity, positive-rollout consistency, and in-batch hard negatives. The paper does not introduce the explicit term “GCPO,” but it states that combining GRPO-style group baselines with group-aggregated contrastive rewards is a natural interpretation of group-wise contrastive policy optimization (Sun et al., 6 Oct 2025).

3. Principal algorithmic realizations

In GeometryZero, GCPO is an explicit RLVR algorithm for geometry problem solving with auxiliary construction. Its central mechanism is Group Contrastive Masking: for each question, the method samples a base group $O$ , a forced-with-auxiliary group $O^{w}$ , and a forced-without-auxiliary group $O^{wo}$ . The auxiliary reward is then masked according to the relative accuracy of the two contrastive groups:

$L_{\mathrm{GCPO\text{-}NCE}}(\theta)= - \mathbb{E}_q \left[\log \sum_{o\in P}\exp(\beta s_\theta(o\mid q))-\log \sum_{o\in P\cup N}\exp(\beta s_\theta(o\mid q))\right].$ 0

and the final reward becomes

$L_{\mathrm{GCPO\text{-}NCE}}(\theta)= - \mathbb{E}_q \left[\log \sum_{o\in P}\exp(\beta s_\theta(o\mid q))-\log \sum_{o\in P\cup N}\exp(\beta s_\theta(o\mid q))\right].$ 1

Here $L_{\mathrm{GCPO\text{-}NCE}}(\theta)= - \mathbb{E}_q \left[\log \sum_{o\in P}\exp(\beta s_\theta(o\mid q))-\log \sum_{o\in P\cup N}\exp(\beta s_\theta(o\mid q))\right].$ 2, with $L_{\mathrm{GCPO\text{-}NCE}}(\theta)= - \mathbb{E}_q \left[\log \sum_{o\in P}\exp(\beta s_\theta(o\mid q))-\log \sum_{o\in P\cup N}\exp(\beta s_\theta(o\mid q))\right].$ 3, $L_{\mathrm{GCPO\text{-}NCE}}(\theta)= - \mathbb{E}_q \left[\log \sum_{o\in P}\exp(\beta s_\theta(o\mid q))-\log \sum_{o\in P\cup N}\exp(\beta s_\theta(o\mid q))\right].$ 4, and $L_{\mathrm{GCPO\text{-}NCE}}(\theta)= - \mathbb{E}_q \left[\log \sum_{o\in P}\exp(\beta s_\theta(o\mid q))-\log \sum_{o\in P\cup N}\exp(\beta s_\theta(o\mid q))\right].$ 5 in the reported experiments (Wang et al., 8 Jun 2025).

A second explicit realization appears in “GCPO: When Contrast Fails, Go Gold.” This version targets a degeneracy of GRPO: when all sampled responses are wrong, intra-group advantages collapse. GCPO resolves that case by substituting one rollout with a gold answer and recomputing the normalized advantage. If the reward vector is not all zero, standard group normalization is used; if all sampled responses fail, one response is replaced by the gold trajectory, yielding a non-degenerate reward vector $L_{\mathrm{GCPO\text{-}NCE}}(\theta)= - \mathbb{E}_q \left[\log \sum_{o\in P}\exp(\beta s_\theta(o\mid q))-\log \sum_{o\in P\cup N}\exp(\beta s_\theta(o\mid q))\right].$ 6 and a nonzero update direction. The loss is

$L_{\mathrm{GCPO\text{-}NCE}}(\theta)= - \mathbb{E}_q \left[\log \sum_{o\in P}\exp(\beta s_\theta(o\mid q))-\log \sum_{o\in P\cup N}\exp(\beta s_\theta(o\mid q))\right].$ 7

with sequence-level importance sampling

$L_{\mathrm{GCPO\text{-}NCE}}(\theta)= - \mathbb{E}_q \left[\log \sum_{o\in P}\exp(\beta s_\theta(o\mid q))-\log \sum_{o\in P\cup N}\exp(\beta s_\theta(o\mid q))\right].$ 8

This algorithm also drops KL regularization and aligns the optimization granularity with the sequence-level verifier reward (Wu et al., 9 Oct 2025).

Neighbor GRPO provides a third realization, this time for deterministic ODE flow models. The method builds a neighborhood group by perturbing initial noise,

$L_{\mathrm{GCPO\text{-}NCE}}(\theta)= - \mathbb{E}_q \left[\log \sum_{o\in P}\exp(\beta s_\theta(o\mid q))-\log \sum_{o\in P\cup N}\exp(\beta s_\theta(o\mid q))\right].$ 9

and defines a distance-based surrogate group policy

$w(q)=\sqrt{\mathrm{Var}(q)}$ 0

The paper explicitly presents this as a contrastive ODE policy optimization mechanism and maps it to a broader GCPO perspective in which grouped candidates are optimized through a softmax over negative distances, with advantages normalized inside the group and optionally reweighted by a quasi-norm (He et al., 21 Nov 2025).

4. Credit assignment and the contrastive signal

One major axis of variation inside the GCPO literature is the level at which contrast is applied. GeometryZero applies contrast at the reward-shaping level: auxiliary construction is rewarded, penalized, or zeroed according to the relative benefit measured by the two contrastive rollout groups. The resulting signal is context-dependent rather than unconditional, which the paper identifies as the main difference from earlier tool-use rewards that positively reinforce auxiliary construction whenever it appears (Wang et al., 8 Jun 2025).

Other works sharpen the contrastive signal at the sequence or token level. CEPO, although not named GCPO, constructs a positive teacher from a correct rollout and a negative teacher from rejected rollouts in the same group. Its contrastive evidence is

$w(q)=\sqrt{\mathrm{Var}(q)}$ 1

with token weight

$w(q)=\sqrt{\mathrm{Var}(q)}$ 2

and token-level reweighted advantage

$w(q)=\sqrt{\mathrm{Var}(q)}$ 3

The paper argues that this sharpens credit on decisive reasoning steps while leaving filler tokens near neutral (Heakl et al., 19 May 2026).

Guidance Contrastive Policy Optimization, which is not “Group Contrastive Policy Optimization” but shares the acronym, applies contrastive credit assignment through positive and negative prompts. It defines a per-token guidance score

$w(q)=\sqrt{\mathrm{Var}(q)}$ 4

normalizes it through a within-sequence empirical CDF, and scales the group-normalized sample advantage as

$w(q)=\sqrt{\mathrm{Var}(q)}$ 5

This adjacent line of work is relevant because it shows how group-based policy optimization can be combined with explicit contrast at token resolution, but it should not be conflated with the group-contrastive formulations described above (Li et al., 28 May 2026).

5. Empirical behavior and application domains

The empirical record of GCPO-like methods spans geometry, mathematical reasoning, multimodal reasoning, text and image generation, video generation, and representation learning. In geometry, GeometryZero reports that GCPO-equipped models “consistently outperform baselines (e.g. GRPO), achieving an average improvement of 4.29% across all benchmarks” on Geometry3K, MathVista, Geomverse, and OlympiadBench (Wang et al., 8 Jun 2025).

In mathematical reasoning for small LLMs, the gold-answer version of GCPO reports substantial gains over both the baseline model and DAPO. The paper states that GCPO improves AIME 2024 mean@32 by 25% over DAPO, delivers roughly a 54% performance gain over the baseline on MathQA, and reaches an average of 36.95 in its six-benchmark ablation table, versus 30.37 for DAPO and 26.25 for the baseline (Wu et al., 9 Oct 2025). A different contrastive reformulation of GRPO shows that 2-GRPO achieves performance on par with 16-GRPO while using only $w(q)=\sqrt{\mathrm{Var}(q)}$ 6 of the rollouts and reducing training time by over 70%, reinforcing the claim that useful group contrast can survive even in the minimal two-rollout case (Wu et al., 1 Oct 2025).

ConSPO, presented as a concrete group-contrastive policy optimization framework for RLVR, consistently outperforms GRPO and several strong baselines on mathematical reasoning. Reported averages include 44.4 versus 40.0 for GRPO at 1.5B scale, 57.5 versus 55.1 at 7B, 51.8 versus 49.1 at 8B, and 34.0 versus 30.3 on Qwen3-4B-Base (Zhang et al., 13 May 2026).

In representation learning, GRACE reframes contrastive learning as policy optimization over generated rationales and reports “broad cross category gains” on MTEB: in the supervised setting, the overall score improves by 11.5% over instruction-tuned base models across four backbones, while the unsupervised variant adds 6.9%, with general-domain metrics preserved around $w(q)=\sqrt{\mathrm{Var}(q)}$ 7 on average. The paper does not name this GCPO, but it explicitly identifies GRPO-style grouping plus contrastive reward shaping as the mechanism (Sun et al., 6 Oct 2025).

In multi-domain post-training, MCPO extends the contrastive idea across prompts and domains. It aligns intra-domain correct rollouts, treats compatible cross-domain correct trajectories as positives, and uses incorrect rollouts as negatives. The reported overall gains are +3.74 over GRPO and +3.36 over DAPO on Qwen3-4B, and +3.71 over GRPO and +4.19 over DAPO on Qwen3-8B (Yu et al., 25 May 2026).

In generative modeling, Neighbor GRPO reports that its contrastive ODE formulation outperforms SDE-based baselines in training cost, convergence speed, and generation quality; the user study records 72% preference against DanceGRPO and 61% against MixGRPO (He et al., 21 Nov 2025).

6. Limitations, open questions, and conceptual boundaries

The literature repeatedly emphasizes computational overhead. In GRACE, generation dominates latency, encoder-only methods are faster, and deployment depends on throughput optimizations such as fused kernels, KV-cache, speculative decoding, and low-precision inference (Sun et al., 6 Oct 2025). GeometryZero quantifies the cost of contrastive group sampling directly: on H100 GPUs, GCPO increases runtime from 7.13 h to 12.53 h at 1.5B, from 11.20 h to 17.26 h at 3B, and from 15.46 h to 22.67 h at 7B, corresponding to overheads of +75.73%, +54.10%, and +46.63% relative to GRPO (Wang et al., 8 Jun 2025).

Many variants also depend on auxiliary signals that may not always be available or reliable. Gold-answer GCPO requires external reference answers, whether human-authored or produced by a stronger teacher model, and this creates an obvious cost and availability constraint (Wu et al., 9 Oct 2025). Guidance Contrastive Policy Optimization requires prompt-dependent rewards and can fail when the negative prompt is not informative or when very small models do not reliably distinguish correct from incorrect conditioning (Li et al., 28 May 2026). MCPO notes that prototype similarity can mis-weight cross-domain positives when correct rollouts are sparse or noisy, while binary rewards collapse nuanced quality differences (Yu et al., 25 May 2026).

A deeper conceptual limitation is terminological rather than algorithmic. “GCPO” does not designate a single universally accepted method. Some papers use it explicitly for group contrastive masking or gold-answer substitution; others apply it as an interpretive umbrella for group-wise InfoNCE, contrastive sequence optimization, or contrastive ODE policy optimization; still others use the same acronym for critical-token, causal, or cooperative policy optimization. The most defensible encyclopedic reading is therefore plural: GCPO refers to a family of grouped policy-optimization schemes in which contrast between positive and negative trajectories, candidates, or reward-conditioned explanations becomes the central update signal, but the precise score, grouping rule, and credit-assignment mechanism remain paper-specific (Zhang et al., 26 Sep 2025, Gu et al., 7 Aug 2025, Chen et al., 12 May 2026).