Two-phase RL Pipelines Overview
- The paper introduces the two-phase RL pipeline, separating rollout (data generation) and training (policy updates) to enhance modularity and scalability.
- Two-phase structures enable explicit resource specialization by disaggregating inference and training clusters, which minimizes dependency bubbles.
- This approach supports advanced RL applications—from large language models to control systems—with near-100% hardware utilization through optimized scheduling.
A two-phase reinforcement learning (RL) pipeline refers to any RL system in which the training loop is structurally divided into two distinct, sequentially executed phases, most canonically a rollout (data generation) phase and a training (policy update) phase. This pattern has become the de facto standard for RL in large-scale LLMs, multitool-augmented vision-language settings, interpretable agents, and hybrid RL/control applications. Two-phase architectures enable explicit resource specialization, modularity, and improved scaling but also introduce new efficiency challenges. The following sections synthesize the recent research landscape, rigorously characterizing two-phase RL pipeline structure, scheduling, optimization, and empirical performance.
1. Architectural Definition and Motivation
The canonical two-phase RL pipeline alternates between:
- Rollout phase: The acting policy (e.g., an LLM, agent, or actor network) interacts with the environment, generating trajectories under the current policy (often called "on-policy" rollouts). For LLM-based RL, this corresponds to token-stream generation; for control, it can mean state-action sequence sampling.
- Training phase: The collected rollouts are processed—scored by a reward model, compared to references, or evaluated by value/advantage estimators—and then used to compute policy or value updates. This phase is compute-intensive, often requiring significant memory and optimized parallelism.
The two-phase split is architecturally motivated by the heterogeneous computational demands of inference (rollout) versus training (backpropagation), as well as the need to maintain modularity and scalability across large clusters or multi-agent systems. Strict synchronization between phases, however, creates dependency "bubbles"—periods when some resources idle while awaiting phase completion in other components (Wu et al., 12 Dec 2025, Zhong et al., 22 Apr 2025).
2. Disaggregation and Scheduling: Overcoming Dependency Bubbles
Disaggregated architectures physically separate the clusters or hardware pools used for rollout (typically inference-optimized GPUs) and for training (compute-optimized GPUs). This maximizes theoretical hardware efficiency but exposes the pipeline to inter-phase dependency bubbles caused by strict on-policy synchronization.
Recent frameworks (e.g., RollMux (Wu et al., 12 Dec 2025), StreamRL (Zhong et al., 22 Apr 2025), SeamlessFlow (Wang et al., 15 Aug 2025)) introduce scheduling, spatiotemporal multiplexing, and asynchrony at the phase and job level to reclaim idle resources:
- Co-execution group abstraction: Jobs are grouped, statically assigned to fixed nodes, and their phases executed in a round-robin schedule. Group residency constraints ensure that massive model states remain in host DRAM for rapid context switching (Wu et al., 12 Dec 2025).
- Two-tier scheduling: Inter-group schedulers perform worst-case stochastic planning for job placement, while intra-group schedulers run a provably optimal cyclic schedule (Wu et al., 12 Dec 2025).
- Streaming and asynchronous generation: Dynamic mini-batch pipelining and overlapped generation/training eliminate pipeline bubbles; a length ranker and skewness-aware dispatching address heavy-tailed trajectory runtimes (Zhong et al., 22 Apr 2025).
- Tag scheduling and capability abstraction: Resources are dynamically tagged according to capability (rollout, training, etc.) and retagged to ensure maximal utilization, with fine-grained pause/resume implemented via a central trajectory/data plane (Wang et al., 15 Aug 2025).
These approaches collectively enable near-100% hardware utilization and strict service-level objective (SLO) guarantees in production-scale clusters.
3. RL Algorithmic Structure in Two-Phase Pipelines
Within each phase, the RL algorithm is tightly coupled to pipeline control:
- Rollout phase: Typically, the latest policy is used to generate a predetermined or adaptive batch of trajectories. In LLM settings, memory bandwidth or batch size limits are critical; in robotics or vision tasks, tool-augmented interaction or environment resetting dominates (Chen et al., 3 Dec 2025, Xia et al., 31 May 2025).
- Training phase: Policy or value updates are computed, most often via policy-gradient methods (PPO, GRPO, DAPO), with critic and reward models running on collected rollouts. For on-policy RL, the rollout and training phases must use matched models and often matched numerical precision (Xi et al., 20 Jan 2026).
- Precision flow synchronization: Errors introduced by mismatched rollout/training precisions (e.g., FP8 rollout with BF16 training) break the on-policy assumption and can destabilize optimization; unified precision propagation through both phases is necessary for stable large-scale training (Xi et al., 20 Jan 2026).
For multi-agent or multi-stage reasoning, the two-phase structure may appear in either the interaction protocol (e.g., triage specialist pipeline (Xia et al., 31 May 2025)) or within recurrent alternations (parallel thinking in competitive programming (Zhang et al., 1 Apr 2026)).
4. Specializations and Pipeline Innovations
Recent works extend the classical two-phase architecture with domain-specific adaptations:
- Parallel/Multithreaded Rollouts: In competitive programming, the "parallel thinking" pipeline combines multiple concurrent and sequential solution attempts, verification, and refinement, with aggregate token budgets as high as 7.6M per problem (Zhang et al., 1 Apr 2026).
- Tool-Augmented Phases: For multimodal/vision-language spatial reasoning, a two-phase double-interactive RL pipeline combines a supervised/curriculum "teaching" phase and an interactive RL "exploration" phase, enabling efficient tool coordination (Chen et al., 3 Dec 2025).
- Interpretable Pipelines: Two-phase evolutionarily optimized, glass-box RL pipelines for vision tasks use a feature extraction phase (interpretable kernel convolution) followed by a reasoning phase (decision tree), co-evolved for performance and interpretability (Custode et al., 2022).
- Hybrid RL+MPC Pipelines: In control, an offline robust, goal-conditioned RL value function is learned in Phase 1, then deployed as a terminal cost in online scenario-based MPC in Phase 2, effectively combining RL's exploration power with MPC's constraint satisfaction (Lawrence et al., 10 Feb 2025).
- RL-then-SFT and Cooperative SFT-RL: In multimodal reasoning and LLM training, two-phase paradigms also structure learning itself—either as explicit RL followed by expert-assisted SFT enhancement (Metis-RISE (Qiu et al., 16 Jun 2025)), or as strictly decoupled SFT→RL (with known forgetting/exploration limitations), or as joint bilevel cooperative optimization (BRIDGE (Chen et al., 8 Sep 2025)).
5. Empirical Results and Efficiency Trade-Offs
Empirical evaluations consistently demonstrate the necessity and impact of two-phase pipeline structure, as well as the superiority of recent scheduling and architectural innovations:
| System | Idle Elimination/Utilization | Cost Efficiency | Notable Gains and Benchmarks |
|---|---|---|---|
| RollMux (Wu et al., 12 Dec 2025) | >2x reduction in idle/bubble time, 100% SLO | 1.84x over naive disagg, 1.38x over co-loc | H800 usage down 2.16x, jobs 3B–32B, 4k–32k tokens |
| StreamRL (Zhong et al., 22 Apr 2025) | Utilization up to 90+% | 1.31–1.33x in cross-DC | 1.30–2.66x throughput vs. SoTA, large clusters |
| SeamlessFlow (Wang et al., 15 Aug 2025) | Empirical GPU utilization >95% | N/A | 2x sample/sec vs. VERL, near-linear scaling |
| Jet-RL (Xi et al., 20 Jan 2026) | Stability at ultra-long rollout, no collapse | 16% end-to-end speedup | Less than 3% acc. loss, up to 1.8x inference/training |
| DIRL (Chen et al., 3 Dec 2025) | N/A | N/A | +12–16% absolute gains, SOTA spatial reasoning |
| Metis-RISE (Qiu et al., 16 Jun 2025) | N/A | N/A | +4.8% avg (RL only) +2.4% (SFT) vs. strong SFT baseline |
| BRIDGE (Chen et al., 8 Sep 2025) | 44% faster vs. classic SFT→RL | N/A | +13% avg. accuracy over cold-start |
Across these systems, best practices include: fine-grained scheduling and residency enforcement, conservative admission control, warm-start context switching, topology-aware model sync, and round-robin time-multiplexed scheduling within small execution groups.
6. Interpretability, Generalization, and Limitations
Two-phase pipelines admit both black-box and glass-box realizations. The interpretable pipeline by Virga et al. (Custode et al., 2022) demonstrates that glass-box, two-phase RL architectures (convolutional high-level feature extractors plus decision trees) can match deep network performance in deterministic Atari, though performance degrades under stochasticity. In reasoning LLMs, explicit two-phase RL–SFT or bilevel SFT–RL approaches address known exploration and catastrophic forgetting problems, though full sample efficiency and generalization across domains or tools remain active research areas.
Known limitations include:
- Sensitivity to straggler/long-tail trajectory completion in rollout;
- Increased scheduling and cluster provisioning complexity;
- The need for large DRAM capacities for state residency;
- Stability issues when rollout and training graphs diverge in numerical precision or architectural detail, particularly at extreme sequence lengths or model sizes.
7. Outlook and Best Practices
As two-phase RL pipelines are increasingly deployed at scale, best practices are converging:
- Partition clusters into small co-execution groups to ensure state residency.
- Employ conservative job admission and worst-case phase duration estimates.
- Use provably optimal intra-group schedules (e.g., round-robin) unless additional job interaction models warrant adaptation.
- Exploit structural dependency bubbles for multiplexing and apply hierarchical, topology-aware model sync.
- Monitor and relabel straggler and tail jobs for batch migration, maintaining strict SLO adherence.
- For glass-box or interpretable settings, evolve modular phase-specific components (feature extractors, decision logic) with joint fitness evaluation (Custode et al., 2022).
This two-phase paradigm, in varied instantiations, extends beyond LLM post-training to tool-augmented spatial VLMs, curriculum-based multi-agent medical agents, hybrid RL/control systems, and interpretable agents. Its efficiency, modularity, and extensibility underpin much of the recent empirical progress in large-scale, high-reliability RL deployments.