ScaleRL: Scalable RL Fine-Tuning for LLMs
- ScaleRL Recipe is a scalable reinforcement learning framework that employs a sigmoidal compute–performance model to predict and optimize fine-tuning outcomes for large language models.
- It integrates empirically validated design choices—such as CISPO loss, FP32 LM head, prompt filtering, and batch-level normalization—to boost asymptotic reward and compute efficiency.
- The framework enables accurate extrapolation from small-scale GPU runs to large training budgets, ensuring reliable performance across diverse RL fine-tuning regimes.
The ScaleRL Recipe is a rigorously engineered, empirically validated approach for scalable reinforcement learning (RL) applied to the fine-tuning of LLMs. Designed to address the predictability and compute efficiency of RL training, ScaleRL integrates a principled scaling framework—using a sigmoidal compute–performance model—with a specific combination of design choices in RL algorithm construction. This synthesis enables practitioners to forecast performance outcomes from small-scale runs, optimize compute utilization, and maximize asymptotic LLM performance in RL-based fine-tuning regimes (Khatri et al., 15 Oct 2025).
1. Predictive Compute–Performance Framework
At the core of the ScaleRL recipe is a formal compute–performance scaling law, modeled as a sigmoidal function. The model relates the validation reward (e.g., pass rate) to the cumulative RL compute budget as:
where:
- is the initial reward (pre-RL or at minimal compute),
- (with ) is the asymptotic reward as compute saturates,
- is a scaling exponent controlling the sharpness and speed of improvement,
- is the compute level at which half of the attainable improvement is realized.
This sigmoidal law outperforms traditional power-law scaling curves by accurately capturing (i) the initial slow regime, (ii) a predictable region of rapid progress, and (iii) an inevitable asymptotic plateau as further compute fails to yield major gains. In the high-compute regime, the curve locally approximates a power-law:
This enables precise extrapolation from small-scale experimental results to large compute budgets, a critical aspect for planning multi-stage or large-scale RL fine-tuning.
2. Empirical Analysis of RL Design Choices
Extensive ablation studies, covering over 400,000 GPU-hours, demonstrate that individual RL design decisions modulate both the asymptotic reward and compute efficiency :
- Loss Function: Switching from standard decoupled advantage PPO (DAPO) to truncated importance sampling loss (CISPO) increases and stabilizes training.
- Precision in Final Layer: Using FP32 precision in the LM (language modeling) head amplifies (e.g., from 0.52 to 0.61) by correcting numerical misalignments between generator and trainer steps.
- Loss Aggregation and Normalization: Prompt-level (rather than generation-level) loss aggregation, along with batch-level advantage normalization, improves (more efficient compute usage) and makes training more stable over long runs.
- Prompt Filtering and Stopping: Adaptive prompt sampling (no-positive-resampling), zero-variance filtering, and forced length interruptions (via explicit termination phrases) ensure robust efficiency without instability from degenerate sample recycling or runaway completions.
Key observation: While many modifications primarily improve efficiency (higher , lower ), some (e.g., CISPO, FP32 head) are essential for reaching the highest , and all are cumulatively important for robust scalability.
3. The ScaleRL Recipe: Composition and Implementation
Synthesizing the empirical findings, the ScaleRL recipe features:
- Asynchronous Off-Policy PipelineRL: Adopts a streaming RL pipeline (e.g., off-policyness) for maximizing GPU utilization and minimizing system idleness.
- Truncation for Length Control: Enforces termination using fixed phrases rather than penalizing completion length, averting excessive or degenerate outputs.
- CISPO Loss Formulation: RL objective uses the CISPO loss, which aggregates over prompts after applying truncated importance ratios and incorporates token-level stop-gradient operations.
- Batch-level Advantage Normalization: Normalizes per-batch advantages to ensure stable credit assignment without disrupting inter-prompt gradients.
- FP32 LM Head: Maintains high numerical fidelity in the last model layer, preventing gradient drift as observed in mixed or low-precision settings.
- Sample and Prompt Filtering: Removes zero-variance samples and applies adaptive prompt selection to focus on informative batches, further boosting compute efficiency.
This integration ensures stable scaling dynamics and aligns the practical recipe with the theoretically grounded scaling law.
4. Extrapolation and Predictability from Small-Scale Runs
ScaleRL's main operational advantage is the ability to interpolate and extrapolate compute requirements and anticipated performance from small, inexpensive pilot runs. The methodology requires fitting the sigmoidal curve to results in the $1.5$k–$8$k GPU-hour range, then robustly predicting trajectory toward $100$k GPU-hours and beyond.
Multiple case studies illustrate that the fitted scaling curve closely tracks the entire training arc—even when projecting several orders of magnitude in compute. This predictive property holds for both dense (e.g., 8B) and mixture-of-expert (MoE, e.g., 17B×16) LLMs, and in both single-task and multitask RL fine-tuning scenarios (e.g., mixed math/code).
5. Comparative Performance and Empirical Endorsement
Evaluation against alternative RL recipes (such as DeepSeek/GRPO, DAPO, Magistral, MiniMax-M1) consistently shows that ScaleRL achieves both higher asymptotic reward and superior compute efficiency. Performance metrics, such as pass rate and sample efficiency, reveal:
- Models using ScaleRL converge faster and achieve a higher maximum reward.
- Generalization across different batch sizes, generation lengths (up to 32k tokens), and realistic multitask RL workloads holds, with stable scaling behavior.
Empirical evidence confirms that “not all recipes yield similar asymptotic performance,” with the ScaleRL combination reliably pushing the practical upper bound.
6. Limitations, Implications, and Future Directions
Analysis shows that while modifications to normalization, curriculum, off-policy depth, and sampling chiefly affect compute efficiency (reflected in , ), a subset of design choices are indispensable for reaching the theoretical asymptote . Even seemingly minor implementation details (e.g., numerical precision, sample filtering) can impact stability in extended runs, particularly as compute budgets enter k GPU-hours.
The ScaleRL framework provides a rigorous paradigm for analytic and practical planning of RL scaling in LLMs, analogous to the pre-training scaling laws that underpin large-scale language modeling. It further enables systematic evaluation of algorithmic innovations—by quantifying their impact on and —and operationalizes rapid scientific iteration by collapsing the experiment–forecast loop.
A plausible implication is that this framework will accelerate algorithmic RL development, enable more predictable budgeting of large-scale experiments, and create a benchmark for the adoption of new RL approaches in LLM fine-tuning.
7. Summary Table of Key Components
| Component | Impact on Scaling | Role in ScaleRL Recipe |
|---|---|---|
| Sigmoidal scaling law | Predicts , | Enables early extrapolation |
| CISPO loss | Boosts , stabilizes | Default loss for robust scaling |
| FP32 LM head | Increases | Mandated for numerical consistency |
| Prompt-level aggregation | Increases | Default for efficiency |
| Batch-level normalization | Increases | Stabilizes long runs |
| Forced length interruptions | Stabilizes scaling | Prevents completion drift |
| Zero-variance/adaptive filtering | Improves | Focuses compute on informative samples |
Each ingredient is verified as essential for achieving both high compute efficiency and reliable large-scale asymptotic performance in RL-based LLM training (Khatri et al., 15 Oct 2025).