APRIL: Active Partial Rollouts in RL

Updated 24 September 2025

APRIL addresses long-tail inefficiencies in RL rollout generation by over-provisioning extra instances and buffering incomplete trajectories, achieving up to 44% throughput gains.
The system integrates seamlessly into existing RL frameworks and supports various GPU hardware, ensuring efficient use of computational resources without altering inference kernels.
Empirical results demonstrate that APRIL’s approach yields improved training stability and convergence, with up to 8% higher end-task accuracy and balanced on-/off-policy data usage.

Active Partial Rollouts in Reinforcement Learning (APRIL) refers to a system-level scheduling paradigm for RL training that targets the long-tail inefficiency in rollout generation—an issue increasingly critical as response lengths and model sizes scale in LLMs. In synchronous RL workflows, a small minority of overly long trajectories can stall batches, leading to underutilization of parallel hardware resources, predominately GPUs. APRIL introduces over-provisioned rollout scheduling and buffer-driven continuation, yielding higher throughput and enhanced overall efficiency, with empirical results confirming substantial speedups and accuracy improvements across standard RL algorithms, model sizes, and hardware backends (Zhou et al., 23 Sep 2025).

1. Motivation and Overview

In RL training for LLMs, response rollout often accounts for over 90% of the total computational runtime (Zhou et al., 23 Sep 2025). A batch of rollout requests exhibits a long-tail distribution in output length—most requests complete quickly, while a few straggle for thousands of additional tokens. Conventional synchronous frameworks force the entire batch to wait for these long-tail instances, causing hardware resources (GPUs) to idle.

APRIL introduces an active, system-level schedule for rollout generation specifically to address this bottleneck. Over-provisioning means launching more rollout instances than the required batch size, allowing the batch to “terminate early” once the target number of completed responses is reached. Unfinished responses are not wasted; rather, they are buffered for continuation in subsequent iterations. This approach ensures that compute resources are efficiently utilized, minimizing idle time and avoiding the trade-off between synchronous consistency and asynchronous efficiency.

2. Algorithmic Mechanism

APRIL modifies the standard RL pipeline as follows:

Over-Provisioning: Instead of initiating N batch rollouts (for batch size N), APRIL triggers N′ > N rollout instances. This means, for every batch, extra rollouts act as backups. As soon as N responses are completed, outstanding rollouts receive an abort signal.
Partial Rollout Buffer: Incomplete rollouts (not finished at early termination) are not discarded. These sequences are buffered—saved and resumed in the next rollout phase under the latest policy parameters. Thus, APRIL introduces intentional “partial rollouts,” providing a continually refreshed pool of incomplete trajectories.
Mixed On-/Off-Policy Training: Finished rollouts may have originated under previous policy parameters and are completed later under the current policy. Training data thus includes fully on-policy rollouts as well as resumed, semi-off-policy rollouts. Formally, each update involves a gradient step as follows:

$\theta_{k+1} \leftarrow \theta_k + \mu \cdot \mathbb{E}_{a\sim\pi(\theta_k)} [ R(a) \nabla_\theta \log \pi(a; \theta_k) ]$

where $R(a)$ may come from a learned reward model, and $\pi(\theta_k)$ is either the current or previous policy when the trajectory was initiated.

Integration: APRIL requires no changes to inference kernels and is compatible with standard RL frameworks (e.g., slime RL) and hardware (NVIDIA H100/H200, AMD MI300). Specialized memory management logic ensures efficient operation under different GPU backends.

3. Experimental Results

APRIL was evaluated on widely used RL algorithms (GRPO, DAPO, GSPO) and LLMs (Qwen3-4B, Qwen3-8B), with benchmarks such as DeepMath-103K and AIME-2024 (Zhou et al., 23 Sep 2025). Key metrics and findings are as follows:

Algorithm	Throughput Gain (Max)	Accuracy Gain (Max)
GRPO (Qwen3-8B, DeepMath-103K)	44%	8%
DAPO (AIME-2024)	8–10%	2–8%
GSPO	Similar	Similar

APRIL achieves up to 44% improved rollout throughput. It often accelerates convergence and yields up to 8% higher end-task accuracy without introducing instability. Analysis of output lengths confirms that intra-instance variation (multiple responses per input) is much less than batch-level variation—APRIL’s resumption strategy maintains diversity but controls the resampling bottleneck. The approach further helps avoid anomalous “explosions” in rollout lengths late in training, reinforcing robust convergence.

4. Practical System Integration

APRIL operates entirely at the scheduling layer, independent of core inference logic. Its agnostic design allows:

Plug-and-play augmentation within RL frameworks without kernel-level changes.
Seamless deployment on both NVIDIA and AMD GPU environments (demonstrated with 8× H100/H200 and 8× MI300 setups).
Utilization of torch_memory_saver and custom ROCm memory handlers for efficient buffer management.

The technique’s generality ensures applicability across upcoming LLM RL training pipelines and hardware platforms.

5. Impacts and Future Directions

APRIL’s system-level optimization bridges a longstanding gap between RL algorithm efficiency and high-performance hardware utilization. Efficient handling of the long-tail in rollout generation enables scalable RL training for ever-larger LLMs. The approach introduces a blend of on-policy and off-policy data that acts as a regularization mechanism, contributing to improved training stability.

Potential avenues for future work include:

Dynamically adapting the degree of over-provisioning based on empirical rollout distributions.
Fine-tuning the interplay between on-policy and resumed data for optimal convergence.
Integrating APRIL with speculative decoding or continuous batching to further improve inference efficiency.
Extending the partial rollout scheduling to fully asynchronous RL systems and settings with extreme variation in response lengths.

APRIL offers a unified view of system and algorithmic RL efficiency. By actively terminating slow rollouts and recycling partial trajectories, it sets a baseline for future RL frameworks to scale LLM training without succumbing to inefficiency induced by long-tail generation dynamics (Zhou et al., 23 Sep 2025).

PDF Markdown Chat (Pro)

References (1)

APRIL: Active Partial Rollouts in Reinforcement Learning to tame long-tail generation (2025)

Follow Topic

Get notified by email when new papers are published related to Active Partial Rollouts in Reinforcement Learning (APRIL).