Papers
Topics
Authors
Recent
Search
2000 character limit reached

Neighbor GRPO: Group Relative Policy Optimization

Updated 28 November 2025
  • Neighbor GRPO is a framework that uses deterministic ODE sampling and neighbor-based policies to optimize models in multi-agent and distributed settings.
  • The approach employs a softmax distance-based surrogate policy and structured loss functions to improve convergence and reduce computational cost.
  • It demonstrates practical benefits in generative modeling, decentralized convex optimization, and reinforcement learning through enhanced sample efficiency and robust performance.

Neighbor GRPO refers to algorithms and frameworks that utilize group-relative or neighbor-based policies for optimization, particularly in settings involving distributed networks, reinforcement learning, and generative model alignment. This article covers three principal domains: (1) contrastive ODE policy optimization for flow-based generative models, (2) asynchronous gossip-based random projection for distributed convex optimization, and (3) group-relative policy optimization for neighbor groups in spatial multi-agent games.

1. Deterministic Contrastive Alignment: Neighbor GRPO for Flow Matching Models

Neighbor GRPO emerges as an advancement over prior SDE-based Group Relative Policy Optimization (GRPO) methods, addressing critical inefficiencies in aligning deterministic flow-based generative models with human-derived rewards. Classical approaches, such as Flow-GRPO and DanceGRPO, introduce stochasticity by mapping deterministic ODEs to SDEs, which:

  • Precludes use of high-order ODE solvers (e.g., DPM-Solver++), restricting training and inference to first-order integrators
  • Impairs credit assignment, as terminal rewards are diffused across a chain of noise-injection steps, leading to slow learning and unstable convergence

Neighbor GRPO solves these by fully preserving deterministic ODE sampling at both training and inference while enabling reinforcement learning–style alignment that is sample-efficient and compatible with high-order solvers (He et al., 21 Nov 2025).

2. Mathematical Formulation and Surrogate Policy

ODE Neighborhood Construction

Given a shared base noise εN(0,I)\varepsilon^* \sim \mathcal{N}(0, I), Neighbor GRPO generates a group of GG neighbor initializations via

ε(i)=1σ2ε+σδ(i),δ(i)N(0,I), i=1,,G, σ(0,1)\varepsilon^{(i)} = \sqrt{1-\sigma^2}\,\varepsilon^* + \sigma\,\delta^{(i)}, \qquad \delta^{(i)} \sim \mathcal{N}(0, I),\ i=1,\dots,G,\ \sigma\in(0,1)

Each ε(i)\varepsilon^{(i)} is deterministically evolved by the ODE solver to yield trajectories x10(i)x_{1\to0}^{(i)}.

Softmax Distance-Based Leaping Policy

A surrogate policy is constructed by randomly selecting one anchor trajectory xt(θ)x_t^{(\theta)} at time tt and defining a softmax over squared L2L_2 distances:

πθ(xt(i){st})=exp(xt(i)xt(θ)22/α)j=1Gexp(xt(j)xt(θ)22/α)\pi_\theta\left(x_t^{(i)} \mid \{s_t\}\right) = \frac{\exp\left(-\|x_t^{(i)} - x_t^{(\theta)}\|_2^2 / \alpha\right)}{\sum_{j=1}^G \exp\left(-\|x_t^{(j)} - x_t^{(\theta)}\|_2^2 / \alpha\right)}

This "leaping" policy acts as a virtual jump among ODE trajectories, conferring training-time stochasticity with marginals preserved up to O(α2)O(\alpha^2).

Surrogate Loss

For terminal rewards rir_i, group-wise (possibly quasi-normed) normalized advantages AiA_i, the per-step surrogate loss is

Lt(θ)=i=1GAi  logexp(xt(i)xt(θ)22/α)jexp(xt(j)xt(θ)22/α)L_t(\theta) = -\sum_{i=1}^G A_i\;\log\frac{\exp\left(-\|x_t^{(i)}-x_t^{(\theta)}\|_2^2 / \alpha\right)}{\sum_j\exp\left(-\|x_t^{(j)}-x_t^{(\theta)}\|_2^2 / \alpha\right)}

Averaging across KK time steps and BB anchors yields the final update objective (He et al., 21 Nov 2025).

3. Theoretical Guarantees and Algorithmic Refinements

Neighbor GRPO’s surrogate policy is shown to be a valid trust-region policy gradient. The underlying geometry matches that of SDE-based GRPO: deterministic ODE samples are incrementally "pulled" toward high-reward perturbations and "pushed" away from low-reward ones, but without the need for SDE noise injection.

Key refinements include:

  • Symmetric anchor sampling: By the Johnson–Lindenstrauss lemma, neighbor initializations are nearly equidistant, enabling computation reduction by updating only on BGB \ll G sampled anchors per iteration
  • Group-wise quasi-norm reweighting: For reward-flattening groups, a quasi-norm (p\ell_p, p<2p<2) normalization is applied to advantages, increasing update selectivity in uninformative settings
  • High-order ODE solver usage: The procedure is compatible with DPM-Solver++ and similar solvers, significantly reducing the number of function evaluations per step

4. Empirical Results

Efficiency and Convergence

Experiments on FLUX.1-dev (latent flow-matching) with 25-step DDIM and 8/16-step DPM-Solver++ demonstrate:

Method NFEθ_\theta Time/Iter (s) Iterations to Target
SDE-based GRPO (G=12) 14 238 \gtrsim100
Neighbor GRPO (25-step DDIM) 4 142 \lesssim50
Neighbor GRPO (8-step DPM++) 1.33 45 \lesssim50

Neighbor GRPO halves the number of required iterations and cuts computational cost relative to SDE-based methods (He et al., 21 Nov 2025).

Generation Quality

Under multi-reward training (composite of HPSv2.1, Pick Score, ImageReward):

  • Neighbor GRPO matches or exceeds baseline models in both in-domain and out-of-domain metrics (CLIP, UnifiedReward, Aesthetic)
  • Maintains strong performance even under aggressive NFE reduction (e.g., HPSv2.1 =0.366=0.366, CLIP =0.391=0.391 at 8-step DPM-Solver++)

Human Preference Study

Blind A/B tests over 3,200 prompts show Neighbor GRPO images are preferred over DanceGRPO/MixGRPO by 61–72% of users (He et al., 21 Nov 2025).

5. Neighbor-Based Gossip Random Projection for Decentralized Optimization

Orthogonally, the notion of "neighbor GRPO" appears in asynchronous gossip-based random projection algorithms for distributed convex optimization over networks (Lee et al., 2013). In this context:

  • Each agent maintains a local variable and communicates only with its immediate neighbors in a connected undirected graph
  • At each iteration, an agent wakes up at random, selects a random neighbor, averages their variables, and both agents perform projected gradient descent onto a random local constraint
  • The method is fully asynchronous, requiring no central coordination, and provides almost-sure convergence under diminishing stepsizes and O(α)O(\alpha) mean-square error under constant stepsizes

The spectral gap 1λ1-\lambda of the expected gossip matrix is critical for convergence rates, and the scheme is particularly suited for settings where projection onto the full constraint set is either expensive or infeasible.

6. Neighbor GRPO in Multi-Agent Reinforcement Learning

GRPO and its variants extend to spatial public goods games (SPGGs) on lattices, where each agent interacts with overlapping neighbor groups (Yang et al., 7 Oct 2025). Group-wise normalization of advantage estimates across GG sampled rollouts per agent ensures stable policy adaptation. In settings using a global cooperation constraint (GCC), incentives for cooperation are maximized at moderate global cooperation and dampened at extremes, leading to:

  • Accelerated and robust onset of cooperation (≥80% at r3.6r\geq3.6 and full cooperation at r5.0r\geq5.0)
  • Suppression of absorbing states (all-defect or all-cooperate)
  • Reduced run-to-run variance, and robustness across initializations and parameter regimes

The GCC mechanism is implemented by modulating payoffs according to gt(1gt)g_t(1-g_t), where gtg_t is the global cooperation fraction at iteration tt; this hump-shaped modifier ensures interior stability (Yang et al., 7 Oct 2025).

7. Significance and Applications

Neighbor GRPO defines a new paradigm for model alignment, distributed optimization, and policy learning in settings where either determinism, decentralization, or interpretability is essential. By introducing structured group-wise surrogate objectives, initial-noise neighborhoods, and neighbor-based synchronization mechanisms, Neighbor GRPO algorithms achieve:

  • Full exploitation of deterministic ODE sampling for generative modeling, compatible with high-order solvers
  • Communication efficiency and scalability in distributed optimization via gossip-based random projections
  • Rapid, self-regulating policy emergence in multi-agent environments, avoiding pathological equilibria

The unifying theme is leveraging local neighbor information—whether in variable averaging, candidate sampling, or group normalization—to enable scalable, robust optimization under limited communication or noise-injection constraints (He et al., 21 Nov 2025, Lee et al., 2013, Yang et al., 7 Oct 2025).

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Neighbor GRPO.