Sync-GRPO: Synchronized Policy Optimization
- Sync-GRPO is a group-relative policy optimization framework that samples multiple candidate trajectories per prompt and applies PPO-like updates.
- It computes normalized, group-relative advantages from verifiable rewards to stabilize training and enhance policy success.
- Advanced implementations integrate speculative decoding and straggler-aware group control to balance computational efficiency with training performance.
Sync-GRPO is a synchronized implementation of Group Relative Policy Optimization in which, for each input prompt, a group of candidate trajectories or completions is sampled, scored by a reward model, converted into relative advantages within the group, and optimized with a PPO-like clipped policy gradient synchronized across many workers (Xu et al., 19 Nov 2025). In DeepSeek-R1-style RL with verifiable rewards, the same GRPO mechanism can be analyzed as a KL-regularized contrastive loss over synthetic data sampled from the old policy, with policy dynamics that admit a closed-form non-parametric optimum and a fixed-point analysis of success probability (Mroueh, 9 Mar 2025). In the synchronous setting, all completions in a group are generated under the same on-policy parameters and must finish before reward computation, group-relative normalization, and parameter update, which gives Sync-GRPO its stability and reproducibility but also makes it sensitive to rollout-length heterogeneity and systems bottlenecks (Khan et al., 1 Jun 2026).
1. Definition, scope, and nomenclature
“Sync-GRPO” in DeepSeek-R1 is one concrete implementation of GRPO for LLMs: for each input prompt it samples a group of candidate trajectories, computes group-relative advantages, and applies a PPO-like clipped policy gradient, synchronized across many workers (Xu et al., 19 Nov 2025). In synchronous training, all completions for the group must finish before rewards are computed, group-relative baselines are formed, and a single policy update is applied; all rollouts in the batch are strictly on-policy and generated by the same parameters (Khan et al., 1 Jun 2026).
The method sits inside the broader GRPO family introduced in DeepSeekMath and used successfully to train DeepSeek-R1 models for promoting reasoning capabilities of LLMs using verifiable or binary rewards (Mroueh, 9 Mar 2025). A common use case is RL with verifiable rewards for math, reasoning, or code generation, where correctness can be checked by exact answer matching, execution success, unit tests, or format constraints (Mroueh, 9 Mar 2025).
The name is adjacent to, but distinct from, “Syn-GRPO (Synthesis-GRPO),” which denotes a multimodal framework that augments GRPO with an online image-level data synthesis loop and a diversity-aware reward for MLLM perception reasoning (Huang et al., 24 Nov 2025). This suggests that “Sync-GRPO” should be reserved for the synchronized training regime of GRPO, whereas “Syn-GRPO” refers to a different method whose defining component is self-evolving data synthesis rather than synchronization.
2. Core optimization mechanics
For a prompt , GRPO samples a group of outputs
computes rewards , and defines the group-relative advantage by shift-and-scale normalization,
In DeepSeek-R1 / Sync-GRPO style, this normalized advantage replaces a learned critic and supplies a PPO-like policy-gradient signal within each group (Vojnovic et al., 25 Feb 2025).
A standard clipped formulation uses the policy ratio
$\frac{\pi_\theta(o\mid q)}{\pi_{\theta_{\text{old}}(o\mid q)}$
inside a clipped PPO-style surrogate together with a KL penalty to a reference policy (Mroueh, 9 Mar 2025). In the notation used for GRPO-RM’s recap of the LLM setting, the core objective is
$\mathcal{J}_{\text{GRPO}}(\theta)= \mathbb{E}\Big[ \text{clip}\Big( \frac{\pi_\theta(o_i\mid q)}{\pi_{\theta_{\text{old}}(o_i\mid q)},1-\varepsilon,1+\varepsilon \Big)A_i \Big] -\beta \,\mathbb{D}_{\mathrm{KL}}(\pi_\theta\|\pi_{\text{ref}}),$
with the group generated from the old policy (Xu et al., 19 Nov 2025).
For verifiable binary rewards 0, the stabilized GRPO analysis writes
1
where 2 is the old policy’s success probability for prompt 3. This yields adaptive positive and negative weights
4
so that successes and failures are up-weighted and down-weighted asymmetrically as a function of current competence (Mroueh, 9 Mar 2025).
In synchronous operation, these statistics are inherently group-local. Correctness of the whitening statistics depends on all 5 samples for a prompt, so a distributed implementation must either place all 6 samples for one prompt on a designated worker or reduce the group statistics across workers before computing the policy loss (Mroueh, 9 Mar 2025).
3. Theoretical interpretations and policy dynamics
One analytical view shows that GRPO with verifiable rewards can be written as a KL-regularized contrastive loss, where the contrastive samples are synthetic data sampled from the old policy (Mroueh, 9 Mar 2025). For a fixed prompt 7, the no-clipping objective becomes
8
so positive samples are outputs with 9, negative samples are outputs with 0, and the KL term is a soft trust region to the reference policy. In the non-parametric setting, the optimal policy at iteration 1 is an exponential tilt of the reference policy,
2
and the success probability 3 evolves through a one-dimensional recurrence 4. The paper shows that fixed points 5 satisfy 6 under mild conditions on 7, thereby demonstrating that GRPO effectively amplifies the probability of success of the policy (Mroueh, 9 Mar 2025).
A second analytical view studies the stationary policies of GRPO through the reward-preference model induced by group-relative normalization (Vojnovic et al., 25 Feb 2025). In that account, the penalty term used in GRPO behaves at stationarity like a reverse KL regularizer, 8, rather than the direct KL 9 commonly associated with RLHF. The resulting aggregation of preferences differs fundamentally from standard logarithmic pooling. For groups of size two, the reward preference model corresponds to pairwise comparison preferences, and in the large-group limit the reward term approaches a mean–standard-deviation normalized expected reward. This suggests that Sync-GRPO is not merely PPO with a batch baseline; it implements a specific group-relative preference aggregation whose behavior depends on group size, normalization, and the direction of KL regularization.
4. Synchronization, systems design, and acceleration
The synchronized implementation is organized around a fixed old policy 0, a fixed reference policy 1, and batched group generation. A typical outer iteration samples a prompt batch 2, sets 3, samples 4 outputs per prompt, computes rewards and group-wise statistics 5, then runs 6 steps of gradient ascent on the GRPO objective (Mroueh, 9 Mar 2025). In a synchronized implementation, steps 2–4 can be done on multiple workers in parallel, each worker receives a shard of 7, computes local 8 and advantages, and gradients are then aggregated synchronously, for example via all-reduce, before updating 9 (Mroueh, 9 Mar 2025).
This synchronization is statistically convenient but computationally expensive because rollout generation dominates wall-clock cost. In practical GRPO training, the generation phase accounts for 91–98% of overall training time, and long, variable-length reasoning traces create a pronounced tail in which effective concurrency collapses as shorter sequences finish (Zhang et al., 26 Sep 2025). FastGRPO addresses this bottleneck with concurrency-aware speculative decoding and online draft learning. Let 0 be the current number of active sequences and 1 the measured concurrency threshold where the target model becomes compute-bound. FastGRPO sets
2
and then adapts draft branching and depth accordingly, so speculation is shallow under high concurrency and deep in the low-concurrency tail. Online draft learning updates the draft model every GRPO iteration using cached target-model states, which keeps average accepted length from degrading as the target policy drifts. Across several mathematical reasoning datasets and models, the method achieves end-to-end speedups of 3 to 4 (Zhang et al., 26 Sep 2025).
A second systems issue is the straggler problem in synchronous GRPO and DAPO. If 5 is the completion length in a group, the straggler ratio is
6
Because all completions in the group must finish before rewards and updates, one unusually long completion can stall reward computation and parameter updates for the entire group (Khan et al., 1 Jun 2026). Straggler-Aware Group Control (SAGC) treats group-size selection as an online constrained optimization problem with utility 7 and a target long-run straggler rate 8. It maintains discounted Beta posteriors over straggler probability for each candidate 9, updates a dual variable
0
and chooses the next group size from a local neighborhood by maximizing sampled utility minus straggler penalty. Across synchronous GRPO and DAPO, and on top of vanilla and strong engineered baselines, SAGC consistently reduces straggler incidence and improves wall-clock efficiency while achieving competitive or better training reward, with gains that transfer to downstream reasoning benchmarks (Khan et al., 1 Jun 2026).
5. Related variants and neighboring formulations
The central Sync-GRPO pattern—group-relative optimization over multiple candidates for the same input—has already been generalized beyond autoregressive text generation. GRPO-RM asks whether the same group-relative RL idea can be made to work for representation models that do not generate token sequences or trajectories. Its answer is to replace sampled sequences with a predefined output set, treat the probability distribution over candidate labels as a policy, compute group-relative advantages over the label set, and optimize with a PPO-style clipped objective while setting 1 and omitting the KL-to-reference term (Xu et al., 19 Nov 2025). This suggests that Sync-GRPO is best understood as a general group-relative policy-optimization template rather than a mechanism tied exclusively to LLM decoding.
A different extension appears in text-to-speech. Multi-Reward GRPO for single-codebook TTS LLMs applies the GRPO framework to codec-token generation with a scalarized reward
2
combining intelligibility, speaker similarity, duration consistency, entropy regularization, and LLM-annotated prosody alignment. The method retains group-wise advantage normalization and PPO-style clipping, but the novelty is in the reward construction rather than the optimization mechanics (Zhong et al., 26 Nov 2025). This places Sync-GRPO inside a broader class of synchronized group-based RL methods in which heterogeneous reward modules are reduced to a single trajectory-level scalar before group-relative normalization.
The neighboring term “Syn-GRPO” denotes yet another branch of the literature. Syn-GRPO, or Synthesis-GRPO, is a reinforcement-learning framework for multimodal LLMs that combines a GRPO workflow with a data server that synthesizes new images, uses a diversity reward to supervise predicted image descriptions, and runs the generation loop in a decoupled and asynchronous manner (Huang et al., 24 Nov 2025). The similarity in naming can obscure a substantive difference: Sync-GRPO concerns synchronized rollout-and-update semantics, whereas Syn-GRPO concerns self-evolving data synthesis.
6. Limitations, misconceptions, and open technical issues
Several limitations arise directly from the current theoretical analyses. The cleanest closed-form dynamics assume binary rewards, stabilized whitening, and often the non-parametric optimization over distributions rather than a neural network parameterization (Mroueh, 9 Mar 2025). The parametric training of real LLMs only approximates these dynamics, and the paper’s convergence result requires that parametric policies stay close to the ideal non-parametric iterates in total variation. The per-prompt analysis also treats prompts independently, whereas actual training averages over 3, so global behavior is a mixture of heterogeneous prompt-specific recurrences (Mroueh, 9 Mar 2025).
At the systems level, a larger group size improves the quality of group-relative baselines but increases the probability of straggler events, creating a persistent tension between statistical efficiency and wall-clock efficiency in synchronous on-policy RL (Khan et al., 1 Jun 2026). FastGRPO adds that speculative decoding tuned for low-concurrency inference may become slower than vanilla decoding under high-concurrency GRPO if drafting and verification are not adapted to current load (Zhang et al., 26 Sep 2025). A plausible implication is that “Sync-GRPO” is best regarded as a joint algorithm-and-systems object: group size, rollout length distribution, concurrency profile, and synchronization policy are part of the method’s effective behavior, not merely implementation detail.
A common misconception is that GRPO is purely an outcome-reward method with no process-level credit assignment. Under within-group overlap of token prefixes, GRPO induces a non-trivial process reward model: the implicit step-level reward for a shared prefix is the mean outcome reward of all completions that share that prefix (Sullivan, 25 Sep 2025). The same work identifies a flaw in the standard objective: non-uniformly distributed process steps are weighted in proportion to their multiplicity 4, which can hinder exploration when large process sets have positive step-level advantage and hinder exploitation when a necessary shared prefix is dragged down by low-reward continuations. Its proposed fix, 5-GRPO,
6
rescales each token by the inverse size of its process set and yields higher validation accuracy, higher downstream reasoning performance, and faster attainment of peak performance than standard GRPO in that study (Sullivan, 25 Sep 2025).
Taken together, these results depict Sync-GRPO as a synchronized, group-relative, on-policy RL framework whose defining properties are not exhausted by its PPO ancestry. Its behavior is shaped by verifiable-reward structure, KL direction, group size, within-group overlap, speculative generation policy, and synchronization stalls. This suggests that future work on Sync-GRPO will continue to proceed along two coupled fronts: sharper analyses of the induced objective and stronger systems mechanisms for making synchronous on-policy training efficient at scale.