Single-stream Policy Optimization (2509.13232v1)

Published 16 Sep 2025 in cs.LG, cs.AI, and stat.ML

Abstract: We revisit policy-gradient optimization for LLMs from a single-stream perspective. Prevailing group-based methods like GRPO reduce variance with on-the-fly baselines but suffer from critical flaws: frequent degenerate groups erase learning signals, and synchronization barriers hinder scalability. We introduce Single-stream Policy Optimization (SPO), which eliminates these issues by design. SPO replaces per-group baselines with a persistent, KL-adaptive value tracker and normalizes advantages globally across the batch, providing a stable, low-variance learning signal for every sample. Being group-free, SPO enables higher throughput and scales effectively in long-horizon or tool-integrated settings where generation times vary. Furthermore, the persistent value tracker naturally enables an adaptive curriculum via prioritized sampling. Experiments using Qwen3-8B show that SPO converges more smoothly and attains higher accuracy than GRPO, while eliminating computation wasted on degenerate groups. Ablation studies confirm that SPO's gains stem from its principled approach to baseline estimation and advantage normalization, offering a more robust and efficient path for LLM reasoning. Across five hard math benchmarks with Qwen3 8B, SPO improves the average maj@32 by +3.4 percentage points (pp) over GRPO, driven by substantial absolute point gains on challenging datasets, including +7.3 pp on BRUMO 25, +4.4 pp on AIME 25, +3.3 pp on HMMT 25, and achieves consistent relative gain in pass@$k$ across the evaluated $k$ values. SPO's success challenges the prevailing trend of adding incidental complexity to RL algorithms, highlighting a path where fundamental principles, not architectural workarounds, drive the next wave of progress in LLM reasoning.

Summary

The paper presents a principled single-stream RL method that replaces group-based normalization with a persistent, KL-adaptive value tracker.
It introduces global advantage normalization and prioritized prompt sampling, reducing variance and improving throughput in policy optimization.
Empirical results on math reasoning benchmarks show SPO’s superior sample efficiency, stability, and scalability in agentic training scenarios.

Single-stream Policy Optimization: A Principled, Scalable Alternative to Group-based RL for LLMs

Introduction and Motivation

Single-stream Policy Optimization (SPO) addresses fundamental inefficiencies in group-based policy gradient methods for LLMs, such as Group Relative Policy Optimization (GRPO). While group-based approaches have been widely adopted for their variance reduction properties, they introduce critical drawbacks: (1) degenerate groups with uniform outcomes yield zero learning signal, wasting compute; (2) synchronization barriers in distributed settings severely limit scalability, especially in agentic or tool-integrated tasks with variable generation times. SPO proposes a return to the classic single-stream paradigm, leveraging a persistent, KL-adaptive value tracker and global advantage normalization to provide stable, low-variance learning signals for every sample, while enabling higher throughput and adaptive curriculum learning.

Figure 1: Illustrations of GRPO and the proposed SPO.

Methodology

KL-Adaptive Value Tracker

SPO replaces per-group, on-the-fly baselines with a persistent Bayesian value tracker for each prompt. For binary rewards, the tracker models the success probability using a Beta distribution, updating its parameters $(\alpha, \beta)$ with each new observation and discounting past evidence according to the KL divergence between the current and previous policies. This adaptive forgetting mechanism ensures the baseline remains relevant as the policy evolves. The tracker’s posterior mean serves as the baseline for advantage estimation, providing a low-variance, temporally-informed estimate of $V_\pi(x)$ .

Global Advantage Normalization

Instead of normalizing advantages within small, per-prompt groups, SPO normalizes them globally across the entire batch. This approach leverages larger sample sizes for more stable statistics and avoids the instability and high variance associated with small group-based normalization. The normalized advantage is then used in a standard PPO-Clip policy loss, ensuring compatibility with established RL optimization techniques.

Prioritized Prompt Sampling

SPO introduces a curriculum learning mechanism by prioritizing prompts with high learning potential, as measured by the estimated standard deviation of the value tracker. Sampling weights are proportional to $\sqrt{\hat{v}(x)(1-\hat{v}(x))} + \epsilon$ , focusing training on prompts that are neither trivial nor impossible, while maintaining exploration. This adaptive curriculum further improves data efficiency and accelerates convergence.

Empirical Results

Performance on Math Reasoning Benchmarks

SPO was evaluated on five challenging math competition benchmarks (AIME 24, AIME 25, BeyondAIME, BRUMO 25, HMMT 25) using Qwen3-8B. Across all datasets, SPO consistently outperformed GRPO on the maj@32 metric, with an average improvement of +3.4 percentage points. Notably, on BRUMO 25, SPO achieved a +7.3 pp gain, and on AIME 25 and HMMT 25, gains of +4.4 pp and +3.3 pp, respectively. The pass@ $k$ curves for SPO were consistently above those of GRPO for all $k$ values, indicating robust improvements in sample efficiency and reliability.

Figure 2: Pass@ $k$ plots comparing GRPO and SPO across five math competition benchmarks.

Signal Efficiency and Stability

A detailed analysis of the learning signal revealed that the majority of GRPO samples fall into degenerate groups, yielding zero advantage and no gradient. In contrast, SPO maintains a low rate of near-zero advantages, which increases only as the value tracker becomes more accurate—reflecting successful learning rather than wasted computation. Furthermore, SPO’s advantage variance is substantially lower and more stable than GRPO’s, which suffers from high volatility due to noisy, small-sample baselines and normalization.

Figure 3: The majority of GRPO samples fall into degenerate groups, yielding zero advantage and no learning signal, while SPO maintains a low rate of near-zero advantages.

Scalability in Agentic Training

SPO’s group-free architecture eliminates the synchronization bottleneck inherent in group-based methods. In agentic training scenarios with high-variance generation times, group-based approaches are bottlenecked by the slowest trajectory in each group, leading to significant compute waste. Simulations demonstrate that SPO can achieve a 4.35× speedup in training throughput by simply collecting the first $N$ completed samples from a larger pool, naturally filtering out stragglers.

Figure 4: In a low-variance environment, group synchronization cost is minimal, but in a high-variance agentic environment, slow trajectories create severe bottlenecks.

Figure 5: The group-free approach achieves a $4.35\times$ speedup over group-based sampling by avoiding synchronization bottlenecks.

Theoretical and Practical Implications

SPO’s design is grounded in foundational RL principles, eschewing the incidental complexity of group-based methods. By decoupling baseline estimation from the current batch and leveraging global normalization, SPO achieves lower variance and higher sample efficiency. The persistent value tracker enables adaptive curriculum learning, further improving convergence rates. The group-free architecture is inherently more scalable, particularly in distributed and agentic settings, and is compatible with a wide range of policy optimization algorithms and advanced RL techniques.

The empirical results challenge the prevailing trend of increasing algorithmic complexity in RL for LLMs, demonstrating that principled, single-stream approaches can yield superior performance and scalability. The analysis of variance reduction and information loss provides a rigorous foundation for understanding the limitations of group-based methods and the advantages of SPO.

Future Directions

Potential avenues for future research include extending SPO to non-binary reward settings, integrating more sophisticated value tracking mechanisms, and exploring its application to even larger models and more complex agentic tasks. Further investigation into optimal curriculum strategies and batching schemes could yield additional efficiency gains. The robust, scalable foundation provided by SPO positions it as a strong candidate for powering the next generation of reasoning and agentic LLMs.

Conclusion

Single-stream Policy Optimization offers a principled, efficient, and scalable alternative to group-based RL methods for LLMs. By leveraging a KL-adaptive value tracker, global advantage normalization, and prioritized sampling, SPO eliminates the critical inefficiencies of group-based approaches and achieves superior empirical performance on challenging reasoning tasks. Its design highlights the enduring value of foundational RL principles and provides a robust platform for future advances in LLM optimization and agentic training.

PDF Markdown

Paper Prompts

Top Community Prompts

Explain it Like I'm 14

off on

Knowledge Gaps

off on

Practical Applications

off on

Glossary

off on

Conceptual Simplification

off on

Explain it Like I'm 14

What is this paper about?

This paper is about teaching LLMs to reason better using reinforcement learning (RL). The authors point out problems with a popular training style called “group-based” learning (like GRPO), and introduce a simpler, faster, and more reliable method called Single-stream Policy Optimization (SPO). SPO helps the model learn from every example without wasting time waiting for slow or unhelpful cases.

What questions are the authors trying to answer?

In simple terms, the paper asks:

How can we train LLMs to reason better without wasting compute or getting unstable learning signals?
Can we avoid the problems of group-based methods (like when all answers in a group are the same, so there’s nothing to learn)?
Can we design a method that scales well when tasks take different amounts of time (for example, when the model uses tools or does long, multi-step reasoning)?

How does the method work?

First, a quick idea of RL here: the model sees a question (prompt), makes an answer, and gets a reward (for example, 1 if correct, 0 if wrong). Training adjusts the model to make good answers more likely.

Why group-based methods struggle

Group-based methods (like GRPO) generate several answers for the same prompt at once and compare them to create a learning signal. But they have two big issues:

Degenerate groups: If all answers are correct or all are wrong, the “difference” between them is zero. That means the model gets no learning signal, and all that compute is wasted.
Waiting on the slowest: In distributed systems, the whole group must finish before training can proceed. If one answer takes a long time (for example, a long tool-use chain), everything waits. This slows training a lot.

The new idea: Single-stream Policy Optimization (SPO)

Instead of using groups, SPO learns from one answer per prompt and still keeps the learning signal stable and strong. You can think of it like grading each answer against a fair, running estimate of how hard that question is.

SPO has three main parts:

A persistent value tracker (the “fair score”)
- Imagine keeping a small record for each question: how often the model has gotten it right in the past. That record is the “value tracker.”
- It updates over time using a simple, statistics-friendly rule (like an adaptive moving average). If the model changes a lot (measured by how different its behavior is now vs. before), the tracker “forgets” older data faster so it stays current.
- This tracker acts as a baseline or expected score. The learning signal becomes “how much better or worse than expected was this answer?”
Batch-wide normalization
- Instead of normalizing within a small group (which is noisy), SPO normalizes the learning signal across the whole batch of different prompts. This makes the signal smoother and more reliable.
Prioritized sampling (an adaptive curriculum)
- SPO picks prompts that are most informative to train on next: not too easy (almost always correct) and not too hard (almost always wrong), but in the middle where the model can learn the most. It still keeps some randomness so it doesn’t ignore other prompts.

One more detail: SPO uses a careful update rule (similar to PPO-Clip) to avoid making giant, risky changes to the model all at once.

What did the experiments show?

Here’s what the authors found when training a Qwen3-8B model on hard math problems (with tool use like a Python interpreter) and testing on tough benchmarks (AIME 24, AIME 25, BeyondAIME, BRUMO 25, HMMT 25):

Better accuracy and smoother training:
- On average, SPO beats GRPO by +3.4 percentage points on a key metric (maj@32).
- Notable gains include: +7.3 points on BRUMO 25, +4.4 on AIME 25, and +3.3 on HMMT 25.
- What is “maj@32”? The model tries each problem up to 32 times with different samples; if the majority of those tries are correct, it counts as correct. This measures consistency, not just luck.
- “pass@k” (another metric) also improves across different values of k, meaning the probability of getting at least one correct answer within k tries is higher with SPO.
Less wasted compute:
- GRPO often creates “degenerate groups” (all right or all wrong), producing zero learning signal. SPO avoids that. In SPO, even small or near-zero signals typically mean “the tracker predicted well,” not “no learning possible.”
More stable learning signals:
- SPO’s baseline (the value tracker) makes the signal less noisy than GRPO’s on-the-fly, per-group measurements. Less noise means more reliable training.
Much better speed in “agentic” settings:
- In simulations where some tasks take much longer than others (like when the model does multi-step tool use), SPO’s group-free design avoids waiting for stragglers.
- Result: a 4.35× speedup in training throughput in a realistic long-tail timing scenario.

Why does this matter?

Simpler and stronger: SPO is simpler than group-based methods (no need to generate multiple answers per prompt), yet it works better. It shows that solid fundamentals can beat complicated workarounds.
Scales to real-world use: Because it doesn’t wait on groups, SPO is well-suited for advanced tasks where the model uses tools, browses, or plans over many steps—settings where some attempts naturally take longer.
Efficient and principled: SPO keeps a persistent, adaptive estimate of “how hard this prompt is for the current model,” normalizes signals in a stable way, and focuses training where it helps most.

In short, SPO helps LLMs learn reasoning more efficiently, wastes less compute, trains faster in complex scenarios, and improves accuracy—without adding a lot of extra complexity.

View Paper Prompt View All Prompts

Knowledge Gaps

Below is a concise, actionable list of knowledge gaps, limitations, and open questions left unresolved by the paper:

Unbiasedness under prioritized prompt sampling: the method reweights the prompt distribution via w_i(x) without importance-correction; how biased is the learned policy relative to the target data distribution, and when (if ever) are IS corrections necessary?
Generality beyond binary rewards: the paper claims easy generalization to non-binary rewards via EMA but provides no derivation, stability analysis, or empirical validation for continuous or multi-objective reward settings (e.g., outcome + format + cost).
Credit assignment in multi-step/agentic settings: SPO treats an entire multi-turn/tool-using interaction as a single action with a sequence-level advantage; how to extend SPO to step-wise/turn-level returns and advantages with partial/intermediate rewards?
Value-tracker KL computation: the KL-adaptive forgetting requires D(x) between the current policy and “the last policy that acted on x,” but the paper does not specify how D(x) is computed efficiently and accurately (token-level vs sequence-level, reference policy vs stored logits, cost at scale).
Memory and scalability of the tabular tracker: storing Beta/EMA state per prompt may be feasible for 16k items but becomes memory-heavy for millions of prompts; what are memory/latency costs and data-structure strategies at web-scale?
Cold-start cost and practicality: the required n0=8 offline samples per prompt can be substantial; the paper does not quantify the net compute vs GRPO (including dynamic sampling) or offer principled ways to amortize/reuse initial estimates across policies/datasets.
Staleness for infrequently sampled prompts: the tracker’s forgetting relies on per-prompt KL with the last-acting policy; how is D(x) computed when a prompt has not been visited for many steps, and does staleness produce biased or high-variance advantages?
Sensitivity to D_half, ρ_min/ρ_max, and ε (priority floor): there is insufficient ablation or guidance on hyperparameter sensitivity, interactions, and default-setting robustness across tasks and model sizes.
Global advantage normalization under biased batches: batch-wise standardization mixes prompts sampled via a non-uniform curriculum; what are the stability and bias implications relative to per-prompt or robust normalization schemes?
Length bias and token-level credit: SPO applies a single sequence-level advantage to all tokens, which can introduce length bias; how does it compare to length-aware baselines (e.g., OPO) or per-token credit assignment in long generations?
Robustness to reward noise and verifier errors: the tracker assumes reliable binary feedback; how sensitive is SPO to mislabeled outcomes or flaky verifiers, and can the tracker incorporate uncertainty/robust updates?
Empirical scope limited to math reasoning (Qwen3-8B): generalization to other domains (coding, instruction following, RLHF with preference models), larger models, and multilingual settings is untested.
Limited baselines: comparisons are solely against GRPO; empirical evaluations against RLOO, OPO, PPO-with-critic, A*-PO, and strong dynamic-sampling variants are missing.
Real-system throughput evidence: the 4.35× speedup is simulation-based; no real cluster, wall-clock, or cost-per-quality measurements are provided under realistic tool latencies and straggler behavior.
Interaction with KL regularization and entropy control: while compatible with PPO-Clip and entropy-preserving variants (Clip-Higher, KL-Cov), the paper does not analyze how the tracker and batch normalization behave under varying KL penalties/entropy targets.
Handling unseen/streaming prompts: the tabular tracker does not generalize across prompts; how to initialize and adapt values for unseen items in streaming or expanding datasets, and would a learned value approximator improve cold-start/generalization?
Curriculum side effects: prioritized sampling may narrow coverage and overfit to “borderline” prompts; beyond a fixed ε, what mechanisms ensure distributional coverage and prevent curriculum collapse?
Safety/behavioral drift: absent an explicit KL-to-reference penalty, how does SPO control policy drift and ensure safe/benign outputs in broader RLHF settings?
Advantage variance theory and guarantees: while empirical variance plots are shown, a formal analysis of variance properties under batch normalization and prioritized sampling (with/without IS) is missing.
Tool-integrated credit and environment variability: the method abstracts away tool latencies and partial progress signals; how to incorporate environment feedback (e.g., tool success rates, intermediate checks) into the value tracker and policy updates?
Tracker update ordering and bias: updating the tracker immediately after observing r(x,y) while computing A(x,y) with v_{-1}(x) preserves action-independence; however, multi-epoch PPO updates reuse samples—does tracker evolution across epochs introduce subtle biases?
Robust normalization under heavy-tailed advantages: the paper uses mean/std; alternatives (e.g., median/IQR, clipping) for heavy-tailed or multimodal advantage distributions are not explored.
Fairness and stability of repeated sampling: focusing on high-uncertainty prompts may repeatedly sample a subset of items; what are the impacts on fairness across data subgroups and on catastrophic forgetting of already-mastered prompts?
Implementation details for D(x) and storage: the required bookkeeping (last-acting policy snapshot per prompt, KL computation, tracker state) is not concretely specified; practical recipes and systems guidance are needed.
Limits of SPO in sparse-success regimes: while the tracker aims to stabilize sparse rewards, the paper does not characterize the regime (success probability p) where SPO’s advantages over GRPO or dynamic sampling are largest/smallest.

These gaps suggest concrete directions: add IS-corrected prioritized sampling baselines; extend to step-wise returns and non-binary rewards with theory and experiments; provide real-system throughput studies; compare against broader baselines; analyze hyperparameter sensitivity; and develop scalable, generalizing value estimators and robust normalization techniques.

View Paper Prompt View All Prompts

Open Questions

We haven't generated a list of open questions mentioned in this paper yet.

Generate Now

Continue Learning

Authors (2)

Collections

Tweets

This paper has been mentioned in 20 tweets and received 477 likes.

Upgrade to Pro to view all of the tweets about this paper:

Start a free 7-day Pro trial

alphaXiv

Single-stream Policy Optimization (75 likes, 1 question)

Single-stream Policy Optimization (2509.13232v1)

Summary

Single-stream Policy Optimization: A Principled, Scalable Alternative to Group-based RL for LLMs

Introduction and Motivation

Methodology

KL-Adaptive Value Tracker

Global Advantage Normalization

Prioritized Prompt Sampling

Empirical Results

Performance on Math Reasoning Benchmarks

Signal Efficiency and Stability

Scalability in Agentic Training

Theoretical and Practical Implications

Future Directions

Conclusion

Paper Prompts

Top Community Prompts

Explain it Like I'm 14

What is this paper about?

What questions are the authors trying to answer?

How does the method work?

Why group-based methods struggle

The new idea: Single-stream Policy Optimization (SPO)

What did the experiments show?

Why does this matter?

Knowledge Gaps

Open Questions

Continue Learning

Related Papers

Authors (2)

Collections

Tweets

alphaXiv