Papers
Topics
Authors
Recent
2000 character limit reached

APRIL: Active Partial Rollouts in RL

Updated 7 January 2026
  • APRIL is a reinforcement learning scheduling method that over-provisions rollouts and recycles partial trajectories to mitigate long-tail latency.
  • It addresses computational inefficiencies by terminating extra rollouts early and reusing incomplete responses, thus optimizing GPU utilization.
  • Empirical results show 20–44% throughput gains and up to 8% accuracy improvements, with seamless integration into existing RL pipelines.

Active Partial Rollouts in Reinforcement Learning (APRIL) are system-level and algorithmic innovations aimed at dramatically improving efficiency in reinforcement learning (RL) for LLMs, specifically by addressing the computational bottleneck caused by the long-tail distribution of rollout response lengths. APRIL leverages over-provisioned, preemptable rollout generation and systematic recycling of incomplete responses to maximize GPU utilization, reduce rollout-phase wall-clock time, and maintain strict data efficiency. Experiments demonstrate 20–44% throughput gains and up to 8% final accuracy improvement across multiple RL objectives and model sizes, all within a hardware- and framework-agnostic design—allowing seamless integration into existing RL pipelines for LLM training (Zhou et al., 23 Sep 2025).

1. Long-Tail Effects in RL Training for LLMs

In standard on-policy RL for LLMs, a batch of NN prompts is processed synchronously by an inference engine, producing auto-regressive trajectories whose lengths follow a heavy-tailed distribution. Empirically, most sequences terminate quickly, but a minority—the "long tail"—approach the maximal generation length, causing overall batch completion to be gated by the slowest trajectories.

Formally, if X1,...,XNi.i.d.FXX_1, ..., X_N \overset{\text{i.i.d.}}{\sim} F_X are rollout generation times, the standard batch completion time is

Tstd=maxi=1...NXi=X(N),T_{\text{std}} = \max_{i=1...N} X_i = X_{(N)},

and expected per-batch idle time is

Δstd=E[X(N)]E[X].\Delta_{\text{std}} = \mathbb{E}[X_{(N)}] - \mathbb{E}[X].

For long-tailed FXF_X (e.g., Pareto or log-normal), E[X(N)]\mathbb{E}[X_{(N)}] scales rapidly with NN, leading to Δstd\Delta_{\text{std}} values that exceed 30–40% of wall-clock time. This results in severe underutilization of GPU resources during synchronous batch rollout (Zhou et al., 23 Sep 2025).

2. APRIL Algorithm: Over-Provisioned and Partial Rollout Scheduling

APRIL reduces tail-induced inefficiency by over-provisioning rollout requests, terminating trajectories as soon as the batch quota is met, and recycling partial rollouts for continuation. The methodology is captured by the following workflow:

  1. Over-provisioned rollouts: Initiate NN' trajectories where N=rNN' = rN (r>1r > 1).
  2. Active collection: Accept the first NN full completions as the batch, abort all remaining rollouts upon reaching this quota.
  3. Partial recycling: Store all incomplete (aborted) trajectories in a FIFO buffer B\mathcal{B}.
  4. Continuation: In subsequent RL steps, resume rollout of partials from B\mathcal{B} under the current policy.
  5. Policy update: Use only completed rollouts for policy optimization.

Pseudocode (LaTeX-style) exemplifies the precise steps and buffer management. The design ensures that no rollout computation is wasted—partial rollouts are deterministically resumed, and the system guarantees strict data utilization without discarding tokens (Zhou et al., 23 Sep 2025).

3. Theoretical and Empirical Throughput Gains

Let X(k:n)X_{(k:n)} denote the kk-th order statistic of nn i.i.d. samples from FXF_X. In APRIL, batch processing terminates at TAPRIL=X(N:rN)T_{\text{APRIL}} = X_{(N : rN)}. The expected speedup is thus

S(r)=E[X(N:N)]E[X(N:rN)],S(r) = \frac{\mathbb{E}[X_{(N:N)}]}{\mathbb{E}[X_{(N:rN)}]},

which, for large NN, asymptotically approximates FX1(1)/FX1(1/r)F_X^{-1}(1)/F_X^{-1}(1/r) under a continuous model. For heavy-tailed FXF_X, even moderate rr (e.g., r=2r=2) delivers significant speedup: experimental results show 20–44% improvement in rollout throughput across algorithms (GRPO, DAPO, GSPO) and tasks (Zhou et al., 23 Sep 2025).

Empirical evaluation provides the following observed throughput and accuracy gains:

Model Algorithm Dataset Throughput Gain Accuracy Gain
Qwen3-8B GRPO DeepMath-103K +44% +7.5%
Qwen3-8B DAPO DeepMath-103K +10% +3.0%
Qwen3-4B GRPO DeepMath-103K +35% +7.2%

APRIL also accelerates convergence (fewer RL steps to target reward) and achieves up to 8% higher final accuracy, attributed in part to the diversity introduced by partial trajectory continuation (Zhou et al., 23 Sep 2025).

4. Compatibility and Integration in RL Infrastructure

APRIL operates as a scheduler enhancement within "two-engine" RL pipelines, which couple inference engines (e.g., vLLM, SGLang) for rollout generation and training engines (e.g., FSDP, Megatron-LM) for policy optimization. APRIL modifies only the inference scheduling layer:

  • Over-provisioned batch requests to the inference engine.
  • Abortion and recycling of slow rollouts.
  • All protocol, model, and hardware interfaces remain unchanged.

This design has been validated on both NVIDIA (H100/H200) and AMD (MI300) clusters, and integrated into the slime RL framework. No framework-specific or hardware-specialized code changes are necessary, ensuring deployment compatibility with existing RLHF and open RL pipelines (Zhou et al., 23 Sep 2025).

5. Relation to Selective Rollout Filtering

A complementary strategy is GRESO (GRPO with Efficient Selective Rollout), which realizes a different form of "active partial rollout" by predicting and omitting uninformative prompts before rollout. GRESO employs a lightweight, online filtering mechanism using reward-dynamics traces:

  • Each prompt maintains a reward-variance trace.
  • If a prompt is repeatedly zero-variance (produces identical rewards across all responses), it is skipped with a probability increasing in the length of the zero-variance streak.
  • The skip rate is controlled adaptively to maintain a target proportion of effective prompts.

GRESO yields up to 2.4× rollout-phase speedup and 2.0× total-training speedup on large math reasoning benchmarks with no significant accuracy loss (Zheng et al., 2 Jun 2025). This approach is orthogonal to APRIL: APRIL targets system-level batch scheduling, while GRESO filters uninformative data at the algorithmic level. A plausible implication is that their combination could provide cumulative gains by reducing both computation and idle time.

6. Hyperparameters, Limitations, and Future Work

APRIL's principal parameter is the oversampling ratio rr; r=2r=2 was found to be robust across tasks. Higher rr increases buffer pressure and the staleness of partials, with diminishing returns and possible risks to the on-policy nature of training if partials are resumed too many steps later. APRIL requires storing approximately 40% of tokens from the previous iteration for recycling, but this fits within existing rollout KV caches without additional GPU RAM overhead.

Potential directions for future development include:

  • Adaptive staleness bounds or importance weighting for resumed partials to maintain on-policy guarantees.
  • Dynamic rtr_t selection or reinforcement-learned scheduling responsive to current rollout statistics.
  • Integration with inference-level optimizations such as continuous batching or speculative decoding for compound efficiency improvements.

A plausible implication is that further system-algorithm co-design, building on ideas from both APRIL and active filtering approaches like GRESO, will be central to achieving scalable RL training for the next generation of LLMs (Zhou et al., 23 Sep 2025, Zheng et al., 2 Jun 2025).

Whiteboard

Topic to Video (Beta)

Follow Topic

Get notified by email when new papers are published related to APRIL – Active Partial Rollouts.