Papers

Topics

Authors

Recent

View all

Detailed Answer

Quick Answer

Concise responses based on abstracts only

Detailed Answer

Well-researched responses based on abstracts and relevant paper content.

Custom Instructions Pro

Preferences or requirements that you'd like Emergent Mind to consider when generating responses

Gemini 2.5 Flash

Gemini 2.5 Flash 49 tok/s

Gemini 2.5 Pro 53 tok/s Pro

GPT-5 Medium 19 tok/s Pro

GPT-5 High 16 tok/s Pro

GPT-4o 103 tok/s Pro

Kimi K2 172 tok/s Pro

GPT OSS 120B 472 tok/s Pro

Claude Sonnet 4 39 tok/s Pro

2000 character limit reached

Single-stream Policy Optimization (2509.13232v1)

Published 16 Sep 2025 in cs.LG, cs.AI, and stat.ML

Abstract: We revisit policy-gradient optimization for LLMs from a single-stream perspective. Prevailing group-based methods like GRPO reduce variance with on-the-fly baselines but suffer from critical flaws: frequent degenerate groups erase learning signals, and synchronization barriers hinder scalability. We introduce Single-stream Policy Optimization (SPO), which eliminates these issues by design. SPO replaces per-group baselines with a persistent, KL-adaptive value tracker and normalizes advantages globally across the batch, providing a stable, low-variance learning signal for every sample. Being group-free, SPO enables higher throughput and scales effectively in long-horizon or tool-integrated settings where generation times vary. Furthermore, the persistent value tracker naturally enables an adaptive curriculum via prioritized sampling. Experiments using Qwen3-8B show that SPO converges more smoothly and attains higher accuracy than GRPO, while eliminating computation wasted on degenerate groups. Ablation studies confirm that SPO's gains stem from its principled approach to baseline estimation and advantage normalization, offering a more robust and efficient path for LLM reasoning. Across five hard math benchmarks with Qwen3 8B, SPO improves the average maj@32 by +3.4 percentage points (pp) over GRPO, driven by substantial absolute point gains on challenging datasets, including +7.3 pp on BRUMO 25, +4.4 pp on AIME 25, +3.3 pp on HMMT 25, and achieves consistent relative gain in pass@$k$ across the evaluated $k$ values. SPO's success challenges the prevailing trend of adding incidental complexity to RL algorithms, highlighting a path where fundamental principles, not architectural workarounds, drive the next wave of progress in LLM reasoning.

Summary

The paper presents a principled single-stream RL method that replaces group-based normalization with a persistent, KL-adaptive value tracker.
It introduces global advantage normalization and prioritized prompt sampling, reducing variance and improving throughput in policy optimization.
Empirical results on math reasoning benchmarks show SPO’s superior sample efficiency, stability, and scalability in agentic training scenarios.

Single-stream Policy Optimization: A Principled, Scalable Alternative to Group-based RL for LLMs

Introduction and Motivation

Single-stream Policy Optimization (SPO) addresses fundamental inefficiencies in group-based policy gradient methods for LLMs, such as Group Relative Policy Optimization (GRPO). While group-based approaches have been widely adopted for their variance reduction properties, they introduce critical drawbacks: (1) degenerate groups with uniform outcomes yield zero learning signal, wasting compute; (2) synchronization barriers in distributed settings severely limit scalability, especially in agentic or tool-integrated tasks with variable generation times. SPO proposes a return to the classic single-stream paradigm, leveraging a persistent, KL-adaptive value tracker and global advantage normalization to provide stable, low-variance learning signals for every sample, while enabling higher throughput and adaptive curriculum learning.

Figure 1: Illustrations of GRPO and the proposed SPO.

Methodology

KL-Adaptive Value Tracker

SPO replaces per-group, on-the-fly baselines with a persistent Bayesian value tracker for each prompt. For binary rewards, the tracker models the success probability using a Beta distribution, updating its parameters $(\alpha, \beta)$ with each new observation and discounting past evidence according to the KL divergence between the current and previous policies. This adaptive forgetting mechanism ensures the baseline remains relevant as the policy evolves. The tracker’s posterior mean serves as the baseline for advantage estimation, providing a low-variance, temporally-informed estimate of $V_\pi(x)$ .

Global Advantage Normalization

Instead of normalizing advantages within small, per-prompt groups, SPO normalizes them globally across the entire batch. This approach leverages larger sample sizes for more stable statistics and avoids the instability and high variance associated with small group-based normalization. The normalized advantage is then used in a standard PPO-Clip policy loss, ensuring compatibility with established RL optimization techniques.

Prioritized Prompt Sampling

SPO introduces a curriculum learning mechanism by prioritizing prompts with high learning potential, as measured by the estimated standard deviation of the value tracker. Sampling weights are proportional to $\sqrt{\hat{v}(x)(1-\hat{v}(x))} + \epsilon$ , focusing training on prompts that are neither trivial nor impossible, while maintaining exploration. This adaptive curriculum further improves data efficiency and accelerates convergence.

Empirical Results

Performance on Math Reasoning Benchmarks

SPO was evaluated on five challenging math competition benchmarks (AIME 24, AIME 25, BeyondAIME, BRUMO 25, HMMT 25) using Qwen3-8B. Across all datasets, SPO consistently outperformed GRPO on the maj@32 metric, with an average improvement of +3.4 percentage points. Notably, on BRUMO 25, SPO achieved a +7.3 pp gain, and on AIME 25 and HMMT 25, gains of +4.4 pp and +3.3 pp, respectively. The pass@ $k$ curves for SPO were consistently above those of GRPO for all $k$ values, indicating robust improvements in sample efficiency and reliability.

Figure 2: Pass@ $k$ plots comparing GRPO and SPO across five math competition benchmarks.

Signal Efficiency and Stability

A detailed analysis of the learning signal revealed that the majority of GRPO samples fall into degenerate groups, yielding zero advantage and no gradient. In contrast, SPO maintains a low rate of near-zero advantages, which increases only as the value tracker becomes more accurate—reflecting successful learning rather than wasted computation. Furthermore, SPO’s advantage variance is substantially lower and more stable than GRPO’s, which suffers from high volatility due to noisy, small-sample baselines and normalization.

Figure 3: The majority of GRPO samples fall into degenerate groups, yielding zero advantage and no learning signal, while SPO maintains a low rate of near-zero advantages.

Scalability in Agentic Training

SPO’s group-free architecture eliminates the synchronization bottleneck inherent in group-based methods. In agentic training scenarios with high-variance generation times, group-based approaches are bottlenecked by the slowest trajectory in each group, leading to significant compute waste. Simulations demonstrate that SPO can achieve a 4.35× speedup in training throughput by simply collecting the first $N$ completed samples from a larger pool, naturally filtering out stragglers.

Figure 4: In a low-variance environment, group synchronization cost is minimal, but in a high-variance agentic environment, slow trajectories create severe bottlenecks.

Figure 5: The group-free approach achieves a $4.35\times$ speedup over group-based sampling by avoiding synchronization bottlenecks.

Theoretical and Practical Implications

SPO’s design is grounded in foundational RL principles, eschewing the incidental complexity of group-based methods. By decoupling baseline estimation from the current batch and leveraging global normalization, SPO achieves lower variance and higher sample efficiency. The persistent value tracker enables adaptive curriculum learning, further improving convergence rates. The group-free architecture is inherently more scalable, particularly in distributed and agentic settings, and is compatible with a wide range of policy optimization algorithms and advanced RL techniques.

The empirical results challenge the prevailing trend of increasing algorithmic complexity in RL for LLMs, demonstrating that principled, single-stream approaches can yield superior performance and scalability. The analysis of variance reduction and information loss provides a rigorous foundation for understanding the limitations of group-based methods and the advantages of SPO.

Future Directions

Potential avenues for future research include extending SPO to non-binary reward settings, integrating more sophisticated value tracking mechanisms, and exploring its application to even larger models and more complex agentic tasks. Further investigation into optimal curriculum strategies and batching schemes could yield additional efficiency gains. The robust, scalable foundation provided by SPO positions it as a strong candidate for powering the next generation of reasoning and agentic LLMs.

Conclusion

Single-stream Policy Optimization offers a principled, efficient, and scalable alternative to group-based RL methods for LLMs. By leveraging a KL-adaptive value tracker, global advantage normalization, and prioritized sampling, SPO eliminates the critical inefficiencies of group-based approaches and achieves superior empirical performance on challenging reasoning tasks. Its design highlights the enduring value of foundational RL principles and provides a robust platform for future advances in LLM optimization and agentic training.