Papers
Topics
Authors
Recent
Search
2000 character limit reached

Single-Rollout PPO for Language Models

Updated 4 July 2026
  • Single-Rollout Proximal Policy Optimization (SR-PPO) is a reinforcement learning method designed for long-horizon language-model reasoning using a single sampled trajectory per prompt.
  • It leverages a Monte Carlo Pass@k critic to transform a single Pass@1 rollout into dense token-level advantages, enabling precise credit assignment from sparse terminal feedback.
  • SR-PPO employs a PPO-style policy update with token-level value differences, offering enhanced stability and efficiency compared to methods that rely on multiple rollouts or temporal-difference bootstrapping.

Single-Rollout Proximal Policy Optimization (SR-PPO) is a PPO-derived reinforcement-learning method for LLMs in which each prompt contributes exactly one sampled trajectory and only a final binary correctness signal is observed. Its central objective is token-level credit assignment under outcome-only supervision: instead of relying on repeated completions per prompt or temporal-difference bootstrapping, SR-PPO trains a prefix critic from Monte Carlo outcomes and converts a single Pass@1 rollout into dense token-level advantages. The method was introduced in “Learning with a Single Rollout via Monte Carlo Pass@k Critic” (Che et al., 24 Jun 2026).

1. Problem setting and algorithmic scope

SR-PPO is formulated for long-horizon language-model reasoning, where a prompt xx induces a generated token sequence y1:Ty_{1:T}, and training observes only a terminal binary label

Y{0,1},Y \in \{0,1\},

with Y=1Y=1 indicating a correct final answer. The paper treats each prefix as a state,

st=(x,y1:t),t=0,,T,s_t=(x,y_{1:t}), \qquad t=0,\dots,T,

and frames the main difficulty as token-level credit assignment from sparse terminal supervision (Che et al., 24 Jun 2026).

The method is motivated by two limitations of repeated-sampling approaches in language-model RL. First, collecting many completions per prompt is expensive when reasoning traces are long or agentic. Second, multiple sampled traces can diverge after a short common prefix, so cross-rollout comparison becomes difficult when reasoning procedures are heterogeneous. In that setting, group-relative methods such as GRPO are described as limited because an outcome reward is too sparse to be attributed to specific actions like intermediate steps, and comparisons across sampled traces are non-trivial when those traces are heterogeneous (Che et al., 24 Jun 2026).

Within this setup, “single-rollout” means one sampled trajectory per prompt during training. It does not denote a single token, a single optimization step in the abstract, or a single demonstration in the imitation-learning sense. This distinction matters because the term could be confused with single-demonstration PPO variants such as PPO+D, which instead replay one human demonstration trajectory and later self-generated trajectories in sparse-reward control (Libardi et al., 2020).

2. Prefix value modeling and the Monte Carlo Pass@k critic

The foundational value quantity in SR-PPO is the prefix success probability under policy π\pi,

qπ(st)=Prπ(Y=1st),q^\pi(s_t)=\Pr_\pi(Y=1\mid s_t),

which the paper identifies as the Pass@1 success probability of continuing from prefix sts_t (Che et al., 24 Jun 2026). Every sampled prefix on a trajectory receives the same Monte Carlo target YY, and the critic is trained by a calibration-oriented loss

Lcredit=1T+1t=0Tlt(ϕ)+λpromptl0(ϕ),L_{\text{credit}}=\frac{1}{T+1}\sum_{t=0}^{T} l_t(\phi) + \lambda_{\text{prompt}}\, l_0(\phi),

with

y1:Ty_{1:T}0

The BCE term is used for discrimination, the Brier term for probability calibration, and the extra prompt-level term anchors the estimate at the initial state (Che et al., 24 Jun 2026).

The paper’s main innovation is to parameterize the critic in terms of prefix Pass@y1:Ty_{1:T}1 rather than only Pass@1. For a prefix y1:Ty_{1:T}2,

y1:Ty_{1:T}3

where y1:Ty_{1:T}4 are conditionally independent continuation outcomes. Although training observes only one Pass@1 rollout per prompt, the critic is parameterized in Pass@y1:Ty_{1:T}5 space and mapped back to an induced Pass@1 estimate via

y1:Ty_{1:T}6

This makes the Pass@y1:Ty_{1:T}7 critic a reparameterized family trained from single-rollout Monte Carlo supervision rather than a direct empirical estimator based on y1:Ty_{1:T}8 sampled continuations per prefix (Che et al., 24 Jun 2026).

The paper argues that Pass@y1:Ty_{1:T}9 is more selective than Pass@1. Since

Y{0,1},Y \in \{0,1\},0

the sensitivity vanishes as Y{0,1},Y \in \{0,1\},1 for Y{0,1},Y \in \{0,1\},2. This attenuates gradients on prefixes that are already easy to solve while preserving sensitivity on marginal prefixes whose success probability remains low but nonzero. In the paper’s interpretation, Pass@Y{0,1},Y \in \{0,1\},3 therefore discounts easily solved prefixes and prioritizes hard ones whose success probability remains marginal (Che et al., 24 Jun 2026).

3. Token-level credit assignment and the PPO-style update

SR-PPO constructs token-level learning signals from successive changes in prefix value. For Pass@1, the local value difference is

Y{0,1},Y \in \{0,1\},4

and the token-level advantage is

Y{0,1},Y \in \{0,1\},5

The paper sets

Y{0,1},Y \in \{0,1\},6

The first term is a local progress estimate; the second is a terminal correction that aligns dense token credit with the observed final outcome (Che et al., 24 Jun 2026).

For Pass@Y{0,1},Y \in \{0,1\},7, the same construction is used: Y{0,1},Y \in \{0,1\},8

Y{0,1},Y \in \{0,1\},9

When Y=1Y=10, this reduces to the Pass@1 form. The paper notes that in qualitative visualizations the terminal correction often dominates the resulting advantage signal, especially on failed generations, which implies that the local-difference term and the calibration term are not equally weighted in practice (Che et al., 24 Jun 2026).

The policy update uses a PPO-style importance ratio. For a batch of Y=1Y=11 prompts with one sampled trajectory per prompt,

Y=1Y=12

with

Y=1Y=13

The token advantages are detached before policy optimization (Che et al., 24 Jun 2026).

In the reported training schedule, PPO clipping is effectively inactive. The paper states that each batch is freshly sampled, a single PPO gradient computation is used, and the training batch size equals the PPO minibatch size; thus before the optimizer step,

Y=1Y=14

so

Y=1Y=15

for every sampled token. Under that schedule, SR-PPO is effectively an on-policy single-step policy-gradient update with dense token-level credit rather than a multi-epoch clipped-ratio optimizer in the standard PPO sense (Che et al., 24 Jun 2026).

4. Pass@k as a reachability surrogate

A distinctive theoretical contribution of SR-PPO is its interpretation of Pass@Y=1Y=16 as an interpolation between success probability and reachability. Let

Y=1Y=17

The appendix proves that for every state Y=1Y=18,

Y=1Y=19

Moreover, if

st=(x,y1:t),t=0,,T,s_t=(x,y_{1:t}), \qquad t=0,\dots,T,0

then the convergence is uniform and satisfies

st=(x,y1:t),t=0,,T,s_t=(x,y_{1:t}), \qquad t=0,\dots,T,1

Thus, large-st=(x,y1:t),t=0,,T,s_t=(x,y_{1:t}), \qquad t=0,\dots,T,2 Pass@st=(x,y1:t),t=0,,T,s_t=(x,y_{1:t}), \qquad t=0,\dots,T,3 ceases to represent “how likely success is under the current policy” and instead approaches “whether success is reachable at all from this prefix” (Che et al., 24 Jun 2026).

The paper also establishes a function-approximation result. For a function class st=(x,y1:t),t=0,,T,s_t=(x,y_{1:t}), \qquad t=0,\dots,T,4, with best approximation error

st=(x,y1:t),t=0,,T,s_t=(x,y_{1:t}), \qquad t=0,\dots,T,5

and

st=(x,y1:t),t=0,,T,s_t=(x,y_{1:t}), \qquad t=0,\dots,T,6

it proves

st=(x,y1:t),t=0,,T,s_t=(x,y_{1:t}), \qquad t=0,\dots,T,7

If st=(x,y1:t),t=0,,T,s_t=(x,y_{1:t}), \qquad t=0,\dots,T,8, then

st=(x,y1:t),t=0,,T,s_t=(x,y_{1:t}), \qquad t=0,\dots,T,9

This ties the learnability of Pass@π\pi0 to the learnability of reachability, and it is the formal basis for the claim that larger π\pi1 can provide a more structured surrogate for credit assignment (Che et al., 24 Jun 2026).

The paper additionally formulates reasoning as a finite-horizon state graph over prefixes, with vertices π\pi2, edges π\pi3, and goal set π\pi4. In that graph, the reachability limit π\pi5 can be computed by backward dynamic programming: π\pi6

π\pi7

Because each node and edge is visited a constant number of times, the runtime is

π\pi8

This does not provide an operational algorithm for real language-model state spaces, but it gives a precise theoretical interpretation of what large-π\pi9 credit modeling is approximating (Che et al., 24 Jun 2026).

5. Relation to PPO and neighboring methods

Standard PPO was introduced as an on-policy method that alternates between collecting data and optimizing a surrogate objective using multiple epochs of minibatch updates on the same batch (Schulman et al., 2017). SR-PPO inherits PPO’s ratio-based policy-gradient form but changes the principal source of token-level learning signal. Its defining modification is not the clipped objective itself; it is the replacement of group-based or TD-style credit estimation with a learned prefix critic trained from Monte Carlo outcomes (Che et al., 24 Jun 2026).

The paper positions SR-PPO directly against two baselines. Relative to REINFORCE-style terminal-reward training, SR-PPO avoids assigning the same return to all tokens in the trajectory. Relative to GRPO, SR-PPO uses one rollout per prompt and does not depend on within-group normalization across multiple completions; instead it computes credit within the realized trajectory itself. Relative to GAE-PPO, it avoids temporal-difference bootstrapping, which the paper argues is a poor fit when supervision is only a delayed binary outcome and trajectories are long (Che et al., 24 Jun 2026).

A concise comparison is useful:

Method Rollouts per prompt Credit signal
PPO On-policy batch reuse Advantage estimate, often GAE
GRPO Multiple completions Group-normalized episodic returns
SR-PPO One rollout Learned prefix Pass@qπ(st)=Prπ(Y=1st),q^\pi(s_t)=\Pr_\pi(Y=1\mid s_t),0 critic and token-level value differences

This positioning also clarifies what SR-PPO is not. It is not primarily a surrogate-redesign method of the kind explored by PPO-RPE, Truly PPO, PPG, or COPG, all of which modify PPO’s regularization geometry or update objective rather than the rollout-count assumption or token-level credit source (Kobayashi, 2020, Wang et al., 2019, Byun et al., 2020, Markowitz et al., 2023). Nor is it a demonstration-guided PPO variant of the PPO+D type, where one human trajectory seeds replay buffers and exploration (Libardi et al., 2020).

A common misconception is therefore to treat SR-PPO as merely “PPO with fewer samples.” More precisely, it is PPO in a single-rollout-per-prompt regime with a different critic architecture and a different notion of advantage. This suggests that its main novelty lies in credit assignment rather than in trust-region mechanics. A plausible implication is that SR-PPO and surrogate-regularization variants are orthogonal design axes rather than mutually exclusive alternatives.

6. Empirical validation, limitations, and significance

The empirical study is explicitly presented as an initial validation rather than a broad benchmark campaign. Training uses Qwen3-1.7B in thinking mode as the policy model, with a separate Qwen3-1.7B-based binary classification network as the prefix critic. Reported hyperparameters include maximum response length qπ(st)=Prπ(Y=1st),q^\pi(s_t)=\Pr_\pi(Y=1\mid s_t),1 tokens, PPO learning rate qπ(st)=Prπ(Y=1st),q^\pi(s_t)=\Pr_\pi(Y=1\mid s_t),2, critic learning rate qπ(st)=Prπ(Y=1st),q^\pi(s_t)=\Pr_\pi(Y=1\mid s_t),3, PPO clip ratio qπ(st)=Prπ(Y=1st),q^\pi(s_t)=\Pr_\pi(Y=1\mid s_t),4, KL coefficient qπ(st)=Prπ(Y=1st),q^\pi(s_t)=\Pr_\pi(Y=1\mid s_t),5, entropy coefficient qπ(st)=Prπ(Y=1st),q^\pi(s_t)=\Pr_\pi(Y=1\mid s_t),6, and sampling temperature qπ(st)=Prπ(Y=1st),q^\pi(s_t)=\Pr_\pi(Y=1\mid s_t),7. SR-PPO uses batches of 256 prompts with one rollout each; the GRPO baseline uses 128 prompts with 8 rollouts per prompt (Che et al., 24 Jun 2026).

Training data come from the DeepScaleR mathematical reasoning dataset. Evaluation is conducted on AIME24, AIME25, and HMMT26, with 128 independent completions per problem and Pass@qπ(st)=Prπ(Y=1st),q^\pi(s_t)=\Pr_\pi(Y=1\mid s_t),8 metrics computed from those samples. The abstract reports stable learning dynamics and consistent gains in Pass@128 success rates on benchmarks such as HMMT26 and AIME24 (Che et al., 24 Jun 2026).

The paper emphasizes Pass@4 as the main operating point. It reports that Pass@4 SR-PPO is more stable than Pass@1 SR-PPO and that it achieves competitive Pass@8 validation performance while using only one rollout per prompt, whereas GRPO uses many more rollouts per update. Figure-based comparisons also show that, under the tested hyperparameters, GAE-PPO essentially fails to learn effectively in the single-rollout regime (Che et al., 24 Jun 2026).

An important quantitative diagnostic concerns sparsity of token advantages. Table 1 reports the fraction of tokens with qπ(st)=Prπ(Y=1st),q^\pi(s_t)=\Pr_\pi(Y=1\mid s_t),9:

  • Pass@4 SR-PPO at step 90: sts_t0
  • Pass@4 SR-PPO at step 180: sts_t1
  • Pass@1 SR-PPO at step 90: sts_t2

The paper interprets this as evidence that Pass@4 induces a much sparser and more selective training signal than Pass@1. This suggests that stability is linked not only to the existence of dense token-level credit, but also to its selectivity (Che et al., 24 Jun 2026).

The limitations are substantial and explicitly acknowledged. The experiments are restricted to mathematical reasoning, one base model family, one seed, and a narrow hyperparameter range. The reachability theory is interpretive rather than operational for real language-model prefix graphs. The “Monte Carlo Pass@sts_t3 critic” is trained from single Pass@1 outcomes via a reparameterization rather than direct empirical Pass@sts_t4 measurements at prefixes. Finally, the appendix notes that the terminal correction term often dominates token credit, leaving open how much of the observed benefit comes from local prefix differences as opposed to global calibration (Che et al., 24 Jun 2026).

In the current literature, SR-PPO therefore occupies a specific position. It is neither a generic replacement for PPO nor a general theorem about one-shot policy optimization. It is a language-model RL method for the single-rollout-per-prompt regime, built around a prefix critic whose target family is indexed by Pass@sts_t5. Its significance lies in showing that dense token-level advantages can be recovered from one sampled trajectory and one final correctness label, and that Pass@sts_t6 provides a principled bridge from success probability to reachability in that setting (Che et al., 24 Jun 2026).

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Single-Rollout Proximal Policy Optimization (SR-PPO).