- The paper introduces a novel asynchronous RL system that decouples generation and training, significantly improving GPU utilization and training speed.
- The system leverages interruptible rollout workers and a decoupled PPO objective to address data staleness and inconsistent policy versions during training.
- Experimental results show up to 2.57x speedup in training throughput and near-linear scalability on up to 512 GPUs compared to synchronous methods.
Training LLMs, particularly for complex reasoning tasks (Large Reasoning Models or LRMs), using Reinforcement Learning (RL) at scale presents significant system challenges. Traditional synchronous RL systems for LLMs alternate strictly between generating model outputs (rollouts) and training the model on these outputs. This approach ensures training on the latest model version but suffers from severe inefficiencies: the generation phase must wait for the longest sequence in a batch to complete, leading to GPU underutilization, and synchronous systems scale poorly as batch sizes per device decrease, often becoming memory-I/O bound. Recent attempts at partial overlap still rely on batched generation, preserving the inefficiency issue.
AReaL (AReaL: A Large-Scale Asynchronous Reinforcement Learning System for Language Reasoning) (2505.24298) introduces a fully asynchronous RL system designed to overcome these limitations by completely decoupling the generation and training phases. The core idea is to allow rollout workers to continuously generate data without waiting for training, while trainer workers update the model whenever a batch of data is ready.
System Architecture
AReaL employs a decoupled architecture with dedicated GPU clusters for generation and training. The system consists of four main components:
- Interruptible Rollout Worker: These workers handle generation requests and continuously produce outputs. Crucially, they can be interrupted mid-generation to load new model parameters from the trainer. When interrupted, KV caches from the old weights are discarded and recomputed with the new weights before generation resumes. This means a single generated trajectory can consist of segments produced by different model versions, which poses an algorithmic challenge.
- Reward Service: This component evaluates the quality of generated responses, providing feedback (e.g., correctness on math problems or code execution results) used as the reward signal for RL training.
- Trainer Workers: These workers sample collected trajectories from a replay buffer, accumulate data to form training batches, and perform PPO updates on the LLM. Data is consumed from the buffer only once to maintain freshness.
- Rollout Controller: This orchestrates the data flow between components. It fetches prompts, sends them to rollout workers, receives generated responses, sends them to the Reward Service, and stores trajectories and rewards in the replay buffer. It also manages model weight synchronization and triggers updates in rollout workers based on the trainer's progress. The controller enforces constraints on data staleness by monitoring the version discrepancy between generated data and the latest model.
<p align="center">
<img src="https://github.com/inclusionAI/AReaL/raw/main/figures/overview.pdf" alt="AReaL System Architecture" width="600"/>
<br/>
<em>AReaL System Architecture (2505.24298)</em>
</p>
Algorithmic Innovations
The asynchronous nature of AReaL introduces challenges related to data staleness and inconsistent policy versions within trajectories. AReaL addresses these with:
- Staleness-Aware Training: AReaL explicitly controls the maximum allowed staleness (version difference) between the model generating data and the model being trained using a hyperparameter η. The rollout controller ensures that data used for training doesn't exceed this staleness bound, prioritizing older data up to the limit.
- Decoupled PPO Objective: To handle trajectories generated by potentially multiple policy versions due to interruptions, AReaL adopts a decoupled PPO objective. This objective separates the behavior policy (the policy used to sample actions) from a proximal policy (a recent policy used for trust-region regularization). The standard PPO objective assumes the behavior and proximal policies are the same (
$\pi_{\text{old}$}
). The decoupled objective uses importance sampling with respect to the proximal policy, which can be a different, more recent policy than the one that generated the data. This allows training on trajectories composed of segments from different policies while maintaining algorithmic correctness. The paper proves that an interrupted generation process can be seen as sampling from a single combined behavior policy (Proposition 1). In practice, the policy parameters before each training update step are used as the proximal policy.
The decoupled PPO objective is formulated as:
J(θ)=Eq∼D,at∼πbhv[t=1∑Hπbhv(at∣st)πθ(at∣st)min(utprox(θ)A^t,clip(utprox(θ),1−ϵ,1+ϵ)A^t)]
where utprox(θ)=πprox(at∣st)πθ(at∣st) and πbhv is the behavior policy (potentially composed of multiple πθ+i versions).
Implementation Details and System Optimizations
AReaL is implemented using Python and PyTorch, built on top of ReaLHF and leveraging SGLang for generation and Megatron-Core for training. Key system-level optimizations include:
- Decoupling CPU/GPU Operations: Rule-based reward computation and data transfer are offloaded to separate threads and pipelined with generation to avoid blocking.
- Interruptible Rollout Workers: Allows loading new weights mid-generation, improving utilization by not waiting for full sequence completion.
- Dynamic Microbatch Allocation: A padding-free sequence packing strategy is used in the trainer to handle variable-length sequences efficiently. An algorithm (Algorithm 1 in the paper) dynamically allocates sequences to micro-batches under memory constraints, maximizing GPU memory utilization and minimizing wasted computation.
Experimental Evaluation
Experiments were conducted on math and code reasoning tasks using Qwen2/Qwen2.5 models ranging from 1.5B to 32B parameters on an H800 GPU cluster.
- End-to-End Performance: AReaL was compared against synchronous systems like DeepScaleR, DeepCoder, and a synchronous AReaL variant. AReaL consistently achieved comparable or better final model performance on benchmarks like AIME24 and LiveCodeBench while significantly reducing training time (up to 2.57x speedup in training throughput).
- Scalability: AReaL demonstrated near-linear scaling efficiency with increasing device count (up to 512 GPUs), outperforming synchronous systems like verl [verl-hybridflow], which often struggled with longer contexts and large models, hitting Out-of-Memory errors. AReaL's asynchronous nature and interruptible generation make it robust to variable and long generation lengths.
- Algorithm Ablations: Ablation studies confirmed the necessity of both controlled staleness and the decoupled PPO objective. Naive PPO with data staleness led to performance degradation. The decoupled PPO objective stabilized training and maintained performance even with moderate staleness (η≤4), which in turn enabled higher throughput.
- System Ablations: Ablations on system features showed that interruptible generation increased throughput by 12-17%, and dynamic microbatch allocation improved PPO training throughput by an average of 30%.
Conclusion
AReaL provides a practical and efficient framework for large-scale RL training of LLMs by adopting a fully asynchronous architecture. By decoupling generation and training, incorporating algorithmic innovations like the decoupled PPO objective to handle data staleness and inconsistent policy versions, and implementing system optimizations like interruptible generation and dynamic batching, AReaL achieves substantial speedups and superior scalability compared to state-of-the-art synchronous systems, without sacrificing model performance. The authors release the code for AReaL, providing a strong foundation for future research and application of large-scale asynchronous RL for LLMs.
Limitations and Future Work
The paper notes limitations, including the fixed inference/training device ratio which might benefit from dynamic adjustment, especially as context lengths evolve during training. Future work could explore applying AReaL to multi-turn interactions and more complex agentic tasks.