Papers
Topics
Authors
Recent
2000 character limit reached

RLSim: High-Fidelity RL Simulator

Updated 10 January 2026
  • RLSim is a high-fidelity simulator that decouples generation and training in RL pipelines to rigorously analyze compute, communication, and network orchestration.
  • It features a modular architecture with a compute engine, network simulator, and control-plane interface that supports dynamic parallelism and adaptive scheduling.
  • Validation demonstrates throughput within 5% of physical testbeds and network metrics within 7%, enabling cost-efficient design of large-scale RL infrastructures.

RLSim is a high-fidelity, modular simulator specifically developed to evaluate the performance and cost-efficiency of large-scale, disaggregated reinforcement learning (RL) systems with a focus on novel data center interconnects, particularly hybrid optical–electrical fabrics such as RFabric. RLSim enables rigorous analysis of complex RL pipelines that disaggregate generation and training stages, capturing the dynamics of asynchronous, parallel workloads and the associated network requirements. Its design directly supports the characterization of compute, communication, and network orchestration strategies at scale, serving as a foundational tool for architects and researchers investigating next-generation RL deployments (Tan et al., 3 Jan 2026).

1. System Architecture and Subsystems

RLSim is organized into three primary subsystems, each responsible for a critical aspect of RL system simulation:

  • Compute & RL-Pipeline Engine: Simulates a disaggregated RL workflow with two asynchronous stages—sample generation (token-by-token decoding) and model training (gradient-by-layer computations). The engine supports arbitrary numbers of Points-of-Delivery (PoDs), with each PoD comprising one or more GPU servers modeled with parameterizable values for raw FLOPS, memory bandwidth, KV-cache size, and interconnect speeds (PCIe/NVLink). Parallelism modes, including tensor parallelism (TP), expert parallelism (EP), and arbitrary fine-grained decomposition (AFD), can be dynamically adjusted at runtime.
  • Network Simulator: A packet- and flow-level simulator based on htsim, extended to model large-scale collective operations (all-reduce, all-to-all, reduce-scatter, broadcast) and RPC-style KV-cache transfers. It supports a broad range of network topologies, including static Clos (fat-tree), oversubscribed fat trees, rail-optimized topologies, flat optical patch-panels (TopoOpt), and hybrid electrical packet switch (EPS)–optical circuit switch (OCS) networks as in RFabric. It models per-link latency, bandwidth, and queueing, as well as optical reconfiguration delays at selected network tiers.
  • Control-Plane and Orchestrator Interface: Emulates orchestration logic such as OrchestrRL’s adaptive scheduler and RFabric’s topology proxy, allowing RLSim to process “intents” from the RL pipeline (e.g., requests to switch parallelism or synchronize weights) and drive corresponding network reconfigurations with look-ahead scheduling—overlapping circuit setup times with compute when feasible.

2. Analytical Modeling and Simulation Fidelity

RLSim employs lightweight analytical kernels for compute, communication, and network reconfiguration, calibrated using microbenchmarks from a 48-GPU H800 physical testbed:

  • GPU Compute Model: Per-batch execution time for parallelism mode PP is modeled as

Tcompute(B,P)=c1BP+c2T_{\mathrm{compute}}(B,P)=c_1\frac{B}{P}+c_2

where BB is batch size and c1c_1, c2c_2 are fitted parameters.

  • Collective Communication: All-reduce over MM bytes and PP peers uses

Tallreduce(M,P)=αlogP+βM(P1)PT_{\mathrm{allreduce}}(M,P)=\alpha \log P + \beta \frac{M(P-1)}{P}

with per-round startup α\alpha and bandwidth factor β=1/Blink\beta=1/B_{\mathrm{link}}. Other collectives are similarly instantiated.

  • RPC & KV-Cache Transfers: Modeled as

TRPC(λ)=γ0+γ1λT_{\mathrm{RPC}}(\lambda) = \gamma_0 + \gamma_1 \lambda

for message size λ\lambda.

  • OCS Reconfiguration: Each device has a fixed switch time TocsT_{\mathrm{ocs}} (e.g., 10 ms for 3D MEMS). Circuit reconfigurations are atomic, with fallback to EPS during transition periods.

All constants are calibrated against testbed measurements. Validation experiments showed simulated throughput within 5% of physical runs and network metrics within 7% across diverse parallelism and sequence-length regimes.

3. Integration with OrchestrRL and RFabric

RLSim provides a C++/Python API for tightly coupling with the OrchestrRL orchestration logic and RFabric’s hybrid network controller:

  • Adaptive Scheduling: At configurable intervals, RLSim invokes the scheduler’s MILP-based planner to select per-instance parallelism, updating GPU allocation and simulating remapping or migration overheads as necessary.
  • Topology Reconfiguration: Phase transitions (e.g., between training collectives and generation all-to-all) trigger “phase intents,” which the control-plane proxy resolves using pre-cached circuit plans. The simulator manages OCS setup by checking for available slack windows, atomically installing new configurations, and mapping flows over the composite EPS+OCS fabric.
  • Workload Injection: RLSim supports user-supplied traces or parametric heavy-tailed models for generation requests and sequence lengths, and includes an embedded ARIMA predictor to mirror online distribution updates for proactive orchestration.

4. Configuration, Scaling, and Usability

RLSim is engineered for scalability and configurability:

  • Configuration Parameters: All key architectural and workload features are specified via a single JSON file, including PoD and server population, GPU per-server counts, link and switch characteristics, OCS type and reconfiguration delay, communication patterns (DP/TP/EP group sizes), RL pipeline ratios, and ARIMA predictor settings.
  • Evaluated Scale: RLSim has been exercised on system sizes up to 32,768 GPUs across 512 PoDs and 64 core-layer OCS switches. Scalability is achieved through flow-level aggregation for large collectives, hierarchical abstraction at the PoD level, bucketed request grouping (e.g., 256-token buckets), and simulation engine optimizations such as event batching, parallel discrete-event execution, and lazy network-state updates.
Parameter Class Examples Range/Scale in Studies
Network Topology Fat-Tree, TopoOpt, RFabric up to 64 core OCS switches
System Population PoDs, servers, GPUs up to 32,768 GPUs
Link/OCS Characteristics 100/200/400/800 Gbps, OCS switch-time 10 ms (3D MEMS)

5. Validation Methodology and Baseline Comparisons

The RLSim validation strategy leverages direct comparison to empirical measurements from a 48-GPU testbed:

  • Metrics Tracked: End-to-end throughput (normalized to non-blocking Fat-Tree = 1.0), per-stage makespans, 95th/99th percentile tail latencies, network cost per GPU (hardware CAPEX), and straggler effect quantification (max-to-median Gen completion ratio).
  • Baselines Include: Static Clos/Fat-Tree (FT), 3:1 oversubscribed Fat-Tree (FT-OS), Rail-Optimized (RO) interconnects, and TopoOpt flat optical fabrics.
  • Validation Outcomes: Simulated Gen/Train stage times agree within 5% of physical testbed records; network transmission statistics are within 7%. This fidelity is maintained across a range of parallelism regimes and workload injection patterns.

6. Key Results, Insights, and Limitations

RLSim has enabled the following empirical findings regarding large-scale RL workloads on reconfigurable fabrics:

  • Performance: RFabric achieves 0.98×–1.02× the throughput of ideal Fat-Tree at a fraction (1 KB) of the capital cost, outperforming FT-OS and TopoOpt (which reach only 0.65–0.75× of FT) in simulation. At the 2048-GPU, 400 Gbps regime, RFabric is within 3% of ideal throughput using 30–40% fewer EPS switches.
  • Cost-Efficiency: Across link speeds (100–800 Gbps), RFabric is consistently Pareto-optimal, delivering ≥2.2× the throughput-per-dollar of FT and RO. The hybrid OCS approach maximizes cost savings at high optical transceiver price points by limiting OCS usage to bulk phases.
  • Straggler Mitigation: Use of OrchestrRL’s balancer in RLSim reduces 99th percentile generation-step tail latency by 30–40% versus static generator assignment.
  • Observed Limitations: Some low-level NIC queuing (such as RoCE flow control) is abstracted, possibly underestimating congestion penalties in heavily oversubscribed EPS modes. The compute model’s linear scaling assumption may underestimate overheads at extreme scale in non-ideal NVLink topologies.
  • Planned Extensions: Future development directions include detailed RDMA-transport modeling, explicit mixture-of-experts (MoE) traffic simulation, integration of power/thermal models, and closer actor-agent co-simulation.

Overall, RLSim provides a validated, high-fidelity platform for exploring the interplay of workload dynamics, compute orchestration, and reconfigurable interconnects in disaggregated RL. It enables rigorous, reproducible exploration of cost-performance trade-offs at scales beyond current physical deployments, facilitating data-driven architectural decision-making for large-scale RL infrastructure (Tan et al., 3 Jan 2026).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Whiteboard

Topic to Video (Beta)

Follow Topic

Get notified by email when new papers are published related to RLSim.