Papers
Topics
Authors
Recent
Assistant
AI Research Assistant
Well-researched responses based on relevant abstracts and paper content.
Custom Instructions Pro
Preferences or requirements that you'd like Emergent Mind to consider when generating responses.
Gemini 2.5 Flash
Gemini 2.5 Flash 189 tok/s
Gemini 2.5 Pro 46 tok/s Pro
GPT-5 Medium 35 tok/s Pro
GPT-5 High 40 tok/s Pro
GPT-4o 103 tok/s Pro
Kimi K2 207 tok/s Pro
GPT OSS 120B 451 tok/s Pro
Claude Sonnet 4.5 38 tok/s Pro
2000 character limit reached

ScaleRL: Scalable RL Fine-Tuning for LLMs

Updated 17 October 2025
  • ScaleRL Recipe is a scalable reinforcement learning framework that employs a sigmoidal compute–performance model to predict and optimize fine-tuning outcomes for large language models.
  • It integrates empirically validated design choices—such as CISPO loss, FP32 LM head, prompt filtering, and batch-level normalization—to boost asymptotic reward and compute efficiency.
  • The framework enables accurate extrapolation from small-scale GPU runs to large training budgets, ensuring reliable performance across diverse RL fine-tuning regimes.

The ScaleRL Recipe is a rigorously engineered, empirically validated approach for scalable reinforcement learning (RL) applied to the fine-tuning of LLMs. Designed to address the predictability and compute efficiency of RL training, ScaleRL integrates a principled scaling framework—using a sigmoidal compute–performance model—with a specific combination of design choices in RL algorithm construction. This synthesis enables practitioners to forecast performance outcomes from small-scale runs, optimize compute utilization, and maximize asymptotic LLM performance in RL-based fine-tuning regimes (Khatri et al., 15 Oct 2025).

1. Predictive Compute–Performance Framework

At the core of the ScaleRL recipe is a formal compute–performance scaling law, modeled as a sigmoidal function. The model relates the validation reward RCR_C (e.g., pass rate) to the cumulative RL compute budget CC as:

RC=R0+AR01+(CmidC)BR_C = R_0 + \frac{A - R_0}{1 + \left(\frac{C_{mid}}{C}\right)^B}

where:

  • R0R_0 is the initial reward (pre-RL or at minimal compute),
  • AA (with 0A10 \leq A \leq 1) is the asymptotic reward as compute saturates,
  • B>0B > 0 is a scaling exponent controlling the sharpness and speed of improvement,
  • CmidC_{mid} is the compute level at which half of the attainable improvement (AR0)(A - R_0) is realized.

This sigmoidal law outperforms traditional power-law scaling curves by accurately capturing (i) the initial slow regime, (ii) a predictable region of rapid progress, and (iii) an inevitable asymptotic plateau as further compute fails to yield major gains. In the high-compute regime, the curve locally approximates a power-law:

RCADCB,D=(AR0)CmidBR_C \approx A - \frac{D}{C^B},\quad D = (A - R_0) \cdot C_{mid}^B

This enables precise extrapolation from small-scale experimental results to large compute budgets, a critical aspect for planning multi-stage or large-scale RL fine-tuning.

2. Empirical Analysis of RL Design Choices

Extensive ablation studies, covering over 400,000 GPU-hours, demonstrate that individual RL design decisions modulate both the asymptotic reward AA and compute efficiency BB:

  • Loss Function: Switching from standard decoupled advantage PPO (DAPO) to truncated importance sampling loss (CISPO) increases AA and stabilizes training.
  • Precision in Final Layer: Using FP32 precision in the LM (language modeling) head amplifies AA (e.g., from 0.52 to 0.61) by correcting numerical misalignments between generator and trainer steps.
  • Loss Aggregation and Normalization: Prompt-level (rather than generation-level) loss aggregation, along with batch-level advantage normalization, improves BB (more efficient compute usage) and makes training more stable over long runs.
  • Prompt Filtering and Stopping: Adaptive prompt sampling (no-positive-resampling), zero-variance filtering, and forced length interruptions (via explicit termination phrases) ensure robust efficiency without instability from degenerate sample recycling or runaway completions.

Key observation: While many modifications primarily improve efficiency (higher BB, lower CmidC_{mid}), some (e.g., CISPO, FP32 head) are essential for reaching the highest AA, and all are cumulatively important for robust scalability.

3. The ScaleRL Recipe: Composition and Implementation

Synthesizing the empirical findings, the ScaleRL recipe features:

  • Asynchronous Off-Policy PipelineRL: Adopts a streaming RL pipeline (e.g., k=8k=8 off-policyness) for maximizing GPU utilization and minimizing system idleness.
  • Truncation for Length Control: Enforces termination using fixed phrases rather than penalizing completion length, averting excessive or degenerate outputs.
  • CISPO Loss Formulation: RL objective uses the CISPO loss, which aggregates over prompts after applying truncated importance ratios and incorporates token-level stop-gradient operations.
  • Batch-level Advantage Normalization: Normalizes per-batch advantages to ensure stable credit assignment without disrupting inter-prompt gradients.
  • FP32 LM Head: Maintains high numerical fidelity in the last model layer, preventing gradient drift as observed in mixed or low-precision settings.
  • Sample and Prompt Filtering: Removes zero-variance samples and applies adaptive prompt selection to focus on informative batches, further boosting compute efficiency.

This integration ensures stable scaling dynamics and aligns the practical recipe with the theoretically grounded scaling law.

4. Extrapolation and Predictability from Small-Scale Runs

ScaleRL's main operational advantage is the ability to interpolate and extrapolate compute requirements and anticipated performance from small, inexpensive pilot runs. The methodology requires fitting the sigmoidal curve to results in the $1.5$k–$8$k GPU-hour range, then robustly predicting trajectory toward $100$k GPU-hours and beyond.

Multiple case studies illustrate that the fitted scaling curve closely tracks the entire training arc—even when projecting several orders of magnitude in compute. This predictive property holds for both dense (e.g., 8B) and mixture-of-expert (MoE, e.g., 17B×16) LLMs, and in both single-task and multitask RL fine-tuning scenarios (e.g., mixed math/code).

5. Comparative Performance and Empirical Endorsement

Evaluation against alternative RL recipes (such as DeepSeek/GRPO, DAPO, Magistral, MiniMax-M1) consistently shows that ScaleRL achieves both higher asymptotic reward and superior compute efficiency. Performance metrics, such as pass rate and sample efficiency, reveal:

  • Models using ScaleRL converge faster and achieve a higher maximum reward.
  • Generalization across different batch sizes, generation lengths (up to 32k tokens), and realistic multitask RL workloads holds, with stable scaling behavior.

Empirical evidence confirms that “not all recipes yield similar asymptotic performance,” with the ScaleRL combination reliably pushing the practical upper bound.

6. Limitations, Implications, and Future Directions

Analysis shows that while modifications to normalization, curriculum, off-policy depth, and sampling chiefly affect compute efficiency (reflected in BB, CmidC_{mid}), a subset of design choices are indispensable for reaching the theoretical asymptote AA. Even seemingly minor implementation details (e.g., numerical precision, sample filtering) can impact stability in extended runs, particularly as compute budgets enter >100>100k GPU-hours.

The ScaleRL framework provides a rigorous paradigm for analytic and practical planning of RL scaling in LLMs, analogous to the pre-training scaling laws that underpin large-scale language modeling. It further enables systematic evaluation of algorithmic innovations—by quantifying their impact on AA and BB—and operationalizes rapid scientific iteration by collapsing the experiment–forecast loop.

A plausible implication is that this framework will accelerate algorithmic RL development, enable more predictable budgeting of large-scale experiments, and create a benchmark for the adoption of new RL approaches in LLM fine-tuning.

7. Summary Table of Key Components

Component Impact on Scaling Role in ScaleRL Recipe
Sigmoidal scaling law Predicts AA, BB Enables early extrapolation
CISPO loss Boosts AA, stabilizes Default loss for robust scaling
FP32 LM head Increases AA Mandated for numerical consistency
Prompt-level aggregation Increases BB Default for efficiency
Batch-level normalization Increases BB Stabilizes long runs
Forced length interruptions Stabilizes scaling Prevents completion drift
Zero-variance/adaptive filtering Improves BB Focuses compute on informative samples

Each ingredient is verified as essential for achieving both high compute efficiency and reliable large-scale asymptotic performance in RL-based LLM training (Khatri et al., 15 Oct 2025).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)
Forward Email Streamline Icon: https://streamlinehq.com

Follow Topic

Get notified by email when new papers are published related to ScaleRL Recipe.