Neighbor GRPO: Group Relative Policy Optimization

Updated 28 November 2025

Neighbor GRPO is a framework that uses deterministic ODE sampling and neighbor-based policies to optimize models in multi-agent and distributed settings.
The approach employs a softmax distance-based surrogate policy and structured loss functions to improve convergence and reduce computational cost.
It demonstrates practical benefits in generative modeling, decentralized convex optimization, and reinforcement learning through enhanced sample efficiency and robust performance.

Neighbor GRPO refers to algorithms and frameworks that utilize group-relative or neighbor-based policies for optimization, particularly in settings involving distributed networks, reinforcement learning, and generative model alignment. This article covers three principal domains: (1) contrastive ODE policy optimization for flow-based generative models, (2) asynchronous gossip-based random projection for distributed convex optimization, and (3) group-relative policy optimization for neighbor groups in spatial multi-agent games.

1. Deterministic Contrastive Alignment: Neighbor GRPO for Flow Matching Models

Neighbor GRPO emerges as an advancement over prior SDE-based Group Relative Policy Optimization (GRPO) methods, addressing critical inefficiencies in aligning deterministic flow-based generative models with human-derived rewards. Classical approaches, such as Flow-GRPO and DanceGRPO, introduce stochasticity by mapping deterministic ODEs to SDEs, which:

Precludes use of high-order ODE solvers (e.g., DPM-Solver++), restricting training and inference to first-order integrators
Impairs credit assignment, as terminal rewards are diffused across a chain of noise-injection steps, leading to slow learning and unstable convergence

Neighbor GRPO solves these by fully preserving deterministic ODE sampling at both training and inference while enabling reinforcement learning–style alignment that is sample-efficient and compatible with high-order solvers (He et al., 21 Nov 2025).

2. Mathematical Formulation and Surrogate Policy

ODE Neighborhood Construction

Given a shared base noise $\varepsilon^* \sim \mathcal{N}(0, I)$ , Neighbor GRPO generates a group of $G$ neighbor initializations via

$\varepsilon^{(i)} = \sqrt{1-\sigma^2}\,\varepsilon^* + \sigma\,\delta^{(i)}, \qquad \delta^{(i)} \sim \mathcal{N}(0, I),\ i=1,\dots,G,\ \sigma\in(0,1)$

Each $\varepsilon^{(i)}$ is deterministically evolved by the ODE solver to yield trajectories $x_{1\to0}^{(i)}$ .

Softmax Distance-Based Leaping Policy

A surrogate policy is constructed by randomly selecting one anchor trajectory $x_t^{(\theta)}$ at time $t$ and defining a softmax over squared $L_2$ distances:

$\pi_\theta\left(x_t^{(i)} \mid \{s_t\}\right) = \frac{\exp\left(-\|x_t^{(i)} - x_t^{(\theta)}\|_2^2 / \alpha\right)}{\sum_{j=1}^G \exp\left(-\|x_t^{(j)} - x_t^{(\theta)}\|_2^2 / \alpha\right)}$

This "leaping" policy acts as a virtual jump among ODE trajectories, conferring training-time stochasticity with marginals preserved up to $O(\alpha^2)$ .

Surrogate Loss

For terminal rewards $r_i$ , group-wise (possibly quasi-normed) normalized advantages $A_i$ , the per-step surrogate loss is

$L_t(\theta) = -\sum_{i=1}^G A_i\;\log\frac{\exp\left(-\|x_t^{(i)}-x_t^{(\theta)}\|_2^2 / \alpha\right)}{\sum_j\exp\left(-\|x_t^{(j)}-x_t^{(\theta)}\|_2^2 / \alpha\right)}$

Averaging across $K$ time steps and $B$ anchors yields the final update objective (He et al., 21 Nov 2025).

Neighbor GRPO’s surrogate policy is shown to be a valid trust-region policy gradient. The underlying geometry matches that of SDE-based GRPO: deterministic ODE samples are incrementally "pulled" toward high-reward perturbations and "pushed" away from low-reward ones, but without the need for SDE noise injection.

Key refinements include:

Symmetric anchor sampling: By the Johnson–Lindenstrauss lemma, neighbor initializations are nearly equidistant, enabling computation reduction by updating only on $B \ll G$ sampled anchors per iteration
Group-wise quasi-norm reweighting: For reward-flattening groups, a quasi-norm ( $\ell_p$ , $p<2$ ) normalization is applied to advantages, increasing update selectivity in uninformative settings
High-order ODE solver usage: The procedure is compatible with DPM-Solver++ and similar solvers, significantly reducing the number of function evaluations per step

4. Empirical Results

Efficiency and Convergence

Experiments on FLUX.1-dev (latent flow-matching) with 25-step DDIM and 8/16-step DPM-Solver++ demonstrate:

Method	NFE $_\theta$	Time/Iter (s)	Iterations to Target
SDE-based GRPO (G=12)	14	238	$\gtrsim$ 100
Neighbor GRPO (25-step DDIM)	4	142	$\lesssim$ 50
Neighbor GRPO (8-step DPM++)	1.33	45	$\lesssim$ 50

Neighbor GRPO halves the number of required iterations and cuts computational cost relative to SDE-based methods (He et al., 21 Nov 2025).

Generation Quality

Under multi-reward training (composite of HPSv2.1, Pick Score, ImageReward):

Neighbor GRPO matches or exceeds baseline models in both in-domain and out-of-domain metrics (CLIP, UnifiedReward, Aesthetic)
Maintains strong performance even under aggressive NFE reduction (e.g., HPSv2.1 $=0.366$ , CLIP $=0.391$ at 8-step DPM-Solver++)

Human Preference Study

Blind A/B tests over 3,200 prompts show Neighbor GRPO images are preferred over DanceGRPO/MixGRPO by 61–72% of users (He et al., 21 Nov 2025).

5. Neighbor-Based Gossip Random Projection for Decentralized Optimization

Orthogonally, the notion of "neighbor GRPO" appears in asynchronous gossip-based random projection algorithms for distributed convex optimization over networks (Lee et al., 2013). In this context:

Each agent maintains a local variable and communicates only with its immediate neighbors in a connected undirected graph
At each iteration, an agent wakes up at random, selects a random neighbor, averages their variables, and both agents perform projected gradient descent onto a random local constraint
The method is fully asynchronous, requiring no central coordination, and provides almost-sure convergence under diminishing stepsizes and $O(\alpha)$ mean-square error under constant stepsizes

The spectral gap $1-\lambda$ of the expected gossip matrix is critical for convergence rates, and the scheme is particularly suited for settings where projection onto the full constraint set is either expensive or infeasible.

6. Neighbor GRPO in Multi-Agent Reinforcement Learning

GRPO and its variants extend to spatial public goods games (SPGGs) on lattices, where each agent interacts with overlapping neighbor groups (Yang et al., 7 Oct 2025). Group-wise normalization of advantage estimates across $G$ sampled rollouts per agent ensures stable policy adaptation. In settings using a global cooperation constraint (GCC), incentives for cooperation are maximized at moderate global cooperation and dampened at extremes, leading to:

Accelerated and robust onset of cooperation (≥80% at $r\geq3.6$ and full cooperation at $r\geq5.0$ )
Suppression of absorbing states (all-defect or all-cooperate)
Reduced run-to-run variance, and robustness across initializations and parameter regimes

The GCC mechanism is implemented by modulating payoffs according to $g_t(1-g_t)$ , where $g_t$ is the global cooperation fraction at iteration $t$ ; this hump-shaped modifier ensures interior stability (Yang et al., 7 Oct 2025).

7. Significance and Applications

Neighbor GRPO defines a new paradigm for model alignment, distributed optimization, and policy learning in settings where either determinism, decentralization, or interpretability is essential. By introducing structured group-wise surrogate objectives, initial-noise neighborhoods, and neighbor-based synchronization mechanisms, Neighbor GRPO algorithms achieve:

Full exploitation of deterministic ODE sampling for generative modeling, compatible with high-order solvers
Communication efficiency and scalability in distributed optimization via gossip-based random projections
Rapid, self-regulating policy emergence in multi-agent environments, avoiding pathological equilibria

The unifying theme is leveraging local neighbor information—whether in variable averaging, candidate sampling, or group normalization—to enable scalable, robust optimization under limited communication or noise-injection constraints (He et al., 21 Nov 2025, Lee et al., 2013, Yang et al., 7 Oct 2025).

Markdown Report Issue Upgrade to Chat

References (3)

Neighbor GRPO: Contrastive ODE Policy Optimization Aligns Flow Models (2025)

Asynchronous Gossip-Based Random Projection Algorithms Over Networks (2013)

GRPO-GCC: Enhancing Cooperation in Spatial Public Goods Games via Group Relative Policy Optimization with Global Cooperation Constraint (2025)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Neighbor GRPO.