Papers
Topics
Authors
Recent
Search
2000 character limit reached

Prune as You Generate: Online Rollout Pruning for Faster and Better RLVR

Published 25 Mar 2026 in cs.CL | (2603.24840v1)

Abstract: Reinforcement Learning with Verifiable Rewards (RLVR) has significantly advanced the reasoning capabilities of LLMs. However, methods such as GRPO and DAPO suffer from substantial computational cost, since they rely on sampling many rollouts for each prompt. Moreover, in RLVR the relative advantage is often sparse: many samples become nearly all-correct or all-incorrect, yielding low within-group reward variance and thus weak learning signals. In this paper, we introduce arrol (Accelerating RLVR via online Rollout Pruning), an online rollout pruning method that prunes rollouts during generation while explicitly steering the surviving ones more correctness-balanced to enhance learning signals. Specifically, arrol trains a lightweight quality head on-the-fly to predict the success probability of partial rollouts and uses it to make early pruning decisions. The learned quality head can further weigh candidates to improve inference accuracy during test-time scaling. To improve efficiency, we present a system design that prunes rollouts inside the inference engine and re-batches the remaining ones for log-probability computation and policy updates. Across GRPO and DAPO on Qwen-3 and LLaMA-3.2 models (1B-8B), arrol improves average accuracy by +2.30 to +2.99 while achieving up to 1.7x training speedup, and yielding up to +8.33 additional gains in average accuracy in test-time scaling. The code is available at https://github.com/Hsu1023/ARRoL.

Summary

  • The paper introduces ARRoL, an online rollout pruning method that optimizes reward balance during RLVR.
  • It employs a lightweight quality head to predict rollout success early, significantly reducing computational overhead.
  • Experimental results demonstrate notable accuracy improvements and faster training, enhancing RL for LLM reasoning.

Online Rollout Pruning for Efficient and Balanced RLVR

Introduction

This work systematically analyzes and addresses core inefficiencies and reward-signal sparsity in RLVR (Reinforcement Learning with Verifiable Rewards) for LLM-based reasoning. RLVR methods such as GRPO and DAPO improve LLM reasoning via verifiable (e.g., correctness-based) rewards but rely on sampling multiple rollouts per prompt, incurring high computational cost and suffering from low within-group reward variance when rewards are dominated by all-correct or all-incorrect traces. The proposed method, ARRoL (Accelerating RLVR via online RoLlout Pruning), introduces an online rollout pruning strategy leveraging a learned, lightweight quality head for partial rollout scoring. This approach implements early, quality-aware pruning for both computational efficiency and explicit reward balance control, substantially improving both model performance and train-test resource usage. Figure 1

Figure 1: Overview of ARRoL, illustrating early pruning with a learned quality head, training acceleration, and accuracy improvements versus GRPO.

Methodology

Reward-Balanced Online Pruning

ARRoL's central innovation is policy-gradient acceleration via pruning uninformative or unbalanced rollouts during generation, rather than only after full sequence sampling. Formally, GRPO operates by normalizing reward signals within prompt-specific groups; when all group rewards are identical (all 0 or all 1), the gradient signal collapses. ARRoL introduces an online mechanism that monitors the evolving reward distribution and prunes rollouts to achieve a target positive-reward ratio (usually ρ=0.5\rho=0.5), maximizing group variance and learning signal. This claim is formally proven with high-probability bounds that show posterior-guided pruning yields empirical reward distributions O(ϵ)O(\epsilon)-close to the optimal balance.

Lightweight Quality Head

Since the true reward for a rollout is only known at generation end, ARRoL trains a small MLP quality head to predict success probability from partial rollout representations. Compared with previous heuristic confidence signals (e.g., token-level entropy, DeepConf, self-certainty), the quality head is shown to better separate correct from incorrect rollouts, as evidenced both by improved distribution separability and higher rank correlations to actual correctness. Figure 2

Figure 2: (a) Failure case for trace confidence metrics; (b) Quality-head scores distinctly separate correct/incorrect rollouts; (c) Higher predictive correlation from the quality head across data; (d) Early detection (length 512\leq 512) is reliable and computationally preferred.

A sliding-window, binned posterior estimator is used for empirical calibration of quality-head outputs. At a detection length (generally 512), the backend computes survival probabilities for each rollout, dynamically discarding low-potential ones to rebalance the group as needed. This allows efficient GPU scheduling and sharper within-group reward variance necessary for effective RL updates.

System Architecture

ARRoL is integrated into a two-part pipeline: a backend (built on vLLM) that handles generation, online pruning, and quality prediction, and a frontend (using verl) responsible for logprobs, reward computation, and policy optimization. Surviving rollouts post-pruning are rebatch-processed, saving both compute and memory throughout the RL pipeline. Figure 3

Figure 3: The ARRoL system: Backend interleaves online quality evaluation and pruning with generation; frontend processes only surviving rollouts for gradient and optimization steps.

Test-Time Scaling

ARRoL’s quality head further enables improved test-time scaling (TTS): when sampling multiple candidates at inference, quality scores are used as voting weights instead of a naive majority or empirical trace confidence. This score-weighted aggregation boosts accuracy, especially in settings where standard confidence proxies fail to align with actual solution correctness.

Experimental Results

Empirical results cover both model-scale and real-time efficiency across multiple RLVR benchmarks (Qwen-3, LLaMA-3.2, Math500, AMC'23, and others):

  • Accuracy gains: On Math500/MinervaMath and challenge sets, ARRoL outperforms vanilla GRPO by +2.30+2.30+2.99+2.99 points, with much larger boosts (+7.50+7.50+10.00+10.00) on certain high-difficulty math problems.
  • Wall-clock speedup: End-to-end training achieves up to $1.6$–1.7×1.7\times speedup across both rollout generation and optimization phases. Major reductions are realized by dropping redundant or low-value rollouts prior to expensive logprob and gradient steps.
  • Test-time scaling: The quality head as a voting mechanism yields test accuracy increases of up to +8.33+8.33 over previous TTS methods using log-likelihood-based heuristics.

ARRoL also robustly outperforms random pruning, increasing within-group reward variance (measured as O(ϵ)O(\epsilon)0) and thus the potential for meaningful policy-gradient updates. Ablation across keep ratio O(ϵ)O(\epsilon)1 confirms the strong effect/efficiency trade-off, with optimal performance at O(ϵ)O(\epsilon)2.

Practical and Theoretical Implications

ARRoL delivers a general, deployable mechanism to unify efficiency and reward-signal control in RLVR. Practically, the system can be adopted in any transformer RLVR pipeline with minimal architectural change. Its learnable confidence signal is task-adaptive and robust against failure modes intrinsic to token-entropy-based confidence proxies. Theoretically, ARRoL's pruning strategy is proven to optimize reward balance in finite groups, making it suitable for domains where binary reward collapse and sparse gradients are major challenges, e.g., tool-use, verification, or complex program synthesis.

Potential extensions include:

  • Adapting online quality-head-guided pruning to imperfectly verifiable or continuous reward spaces.
  • Integration with advanced speculative decoding and parallelized TTS.
  • Cross-domain adaptation where reward function calibration may itself be uncertain.

Conclusion

ARRoL establishes online, reward-balance-driven rollout pruning as an effective and efficient paradigm for RLVR-based LLM training. By combining a lightweight, high-quality prediction head with dynamic pruning and robust system integration, ARRoL achieves both substantial accuracy gains and strong computational savings. The framework further extends to test-time aggregation, suggesting future directions for confidence-weighted sampling in both training and inference. This advances the state of efficient RL for reasoning-centric LLMs and provides theoretical foundations for reward variance maximization through online sampling interventions (2603.24840).

Paper to Video (Beta)

No one has generated a video about this paper yet.

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Open Problems

We found no open problems mentioned in this paper.

Collections

Sign up for free to add this paper to one or more collections.

Tweets

Sign up for free to view the 1 tweet with 13 likes about this paper.