Papers
Topics
Authors
Recent
2000 character limit reached

Compute-Optimal Scaling in Deep RL

Updated 5 November 2025
  • Compute-optimal scaling strategies are methods for optimally allocating finite computational resources among model capacity, batch size, and update-to-data ratio in deep reinforcement learning.
  • They use empirical scaling laws and power-law relationships to balance tradeoffs between model size and UTD, enhancing sample efficiency under fixed compute budgets.
  • The framework emphasizes mitigating TD-overfitting by adjusting batch sizes and model capacities based on the dynamic quality of temporal-difference targets.

Compute-optimal scaling strategies define how to allocate finite computational resources between different scaling axes—such as model capacity, batch size, and update-to-data (UTD) ratio—in value-based deep reinforcement learning (RL), with the goal of maximizing sample efficiency or minimizing the time to a target return. These strategies are fundamentally distinct from supervised learning analogues due to the unique structure of temporal-difference (TD) learning, where targets are dynamically generated and coupled to the current Q-network. The following sections synthesize the empirical findings, theoretical laws, and practical recommendations for compute-optimal scaling in value-based deep RL, as established in the study of model and UTD scaling (Fu et al., 20 Aug 2025).

1. Scaling Axes and the Compute-Optimal Allocation Problem

Value-based deep RL admits two principal axes for compute allocation:

  • Model Capacity (NN): The parameter count or width of the Q-network.
  • UTD Ratio (σ\sigma): The number of Q-network updates per new environment or data step ("updates-to-data").

Given a fixed compute (FLOPs) budget, the central question is how to partition resources between NN and σ\sigma to achieve the fastest learning or best asymptotic return per sample and per unit compute.

Compute cost to reach a return level JJ is given as: CJ(σ,N)σNDJ(σ,N)\mathcal{C}_J(\sigma, N) \propto \sigma \cdot N \cdot \mathcal{D}_J(\sigma, N) where DJ(σ,N)\mathcal{D}_J(\sigma, N) is the amount of data required to reach JJ at given settings.

2. Empirical Structure: Model Size, Batch Size, and UTD Interactions

Distinct Scaling Regimes:

  • Small models: Best performance is obtained with small batch sizes; aggressive batch scaling causes overfitting.
  • Large models: With increased capacity, larger batch sizes do not degrade, and often improve, Q-value generalization.
  • Effect of UTD: At constant model size, higher σ\sigma (more updates per data step) reduces the maximum safe batch size.

TD-Overfitting:

  • Definition: For small models, increasing batch size reduces training TD error but increases validation TD error, indicating overfitting to low-quality, unstable TD targets.
  • Unlike supervised learning where targets are stationary, the quality of TD targets is limited by model capacity itself. Large batch TD-updates exacerbate this dynamic for small models.
  • Resolution: Increase model size, which improves the quality of computed TD targets, allowing larger batch sizes without overfitting.

Maximal Safe Batch Size Formula:

B~(σ,N)aBσαB+bBσαBNβB\tilde{B}(\sigma, N) \approx \frac{a_B}{\sigma^{\alpha_B} + b_B \cdot \sigma^{\alpha_B} N^{-\beta_B}}

where aBa_B, bBb_B, αB\alpha_B, βB>0\beta_B > 0 are empirical constants. This formula encodes the increase of batch size tolerance with larger NN and decrease with higher σ\sigma.

3. Empirical Scaling Laws for Compute and Sample Efficiency

Data Efficiency Fit:

DJ(σ,N)DJmin+aJσαJ+bJNβJ\mathcal{D}_J(\sigma, N) \approx D_J^{\min} + a_J \sigma^{\alpha_J} + b_J N^{-\beta_J}

with all coefficients positive.

Implications:

  • DJ\mathcal{D}_J increases with UTD and decreases with NN, but with diminishing returns.
  • At fixed compute or data, the optimal frontier lies at predictable power-law tradeoffs between NN and σ\sigma.

Optimal Scaling under a Fixed Data Budget (D0D_0):

σ(D0)(D0Dmin)1/αJ,N(D0)(D0Dmin)1/βJ\sigma^*(D_0) \sim (D_0 - D^{\min})^{1/\alpha_J}, \qquad N^*(D_0) \sim (D_0 - D^{\min})^{1/\beta_J}

When considering both data and compute costs (budget F0=C+δDF_0 = C + \delta D with δ\delta the cost of data),

σF(F0)=aFF0αF,NF(F0)=bFF0βF\sigma^*_F(F_0) = a_F F_0^{\alpha_F}, \qquad N^*_F(F_0) = b_F F_0^{\beta_F}

where all parameters are fit on empirical data.

Computational Cost Summary Table

Quantity Formula Interpretation
Compute to reach JJ CJ(σ,N)σNDJ(σ,N)\mathcal{C}_J(\sigma, N) \propto \sigma N \mathcal{D}_J(\sigma, N) Total necessary training FLOPs
Data efficiency DJ(σ,N)DJmin+aJσαJ+bJNβJ\mathcal{D}_J(\sigma, N) \approx D_J^{\min} + a_J \sigma^{\alpha_J} + b_J N^{-\beta_J} Data to reach JJ as a function of NN and σ\sigma
Maximal safe batch size B~(σ,N)aBσαB+bBσαBNβB\tilde{B}(\sigma, N) \approx \frac{a_B}{\sigma^{\alpha_B} + b_B \cdot \sigma^{\alpha_B} N^{-\beta_B}} Avoids TD-overfitting

4. Diagnosing and Managing TD-Overfitting

Mechanism:

  • Overfitting in TD-learning does not result primarily from data memorization, but from overfitting to poor quality TD targets—themselves a function of limited capacity and underconstrained targets.
  • Passive Critic Experiment: Training a separate network to match TD targets generated by the Q-network shows that generalization is limited primarily by the target quality, not just memorization of sampled data.

Countermeasures:

  • Monitor validation (not just training) TD error as both batch size and UTD increase.
  • For small models, reduce batch size as UTD increases; consider increasing model size to unlock safe larger batch sizes.

5. Practical Guidance for Compute-Optimal Scaling

  1. Optimize model size (NN) and UTD (σ\sigma) using empirical scaling fits—partition compute via the empirical scaling formulas to maximize sample efficiency.
  2. Set batch size as large as permitted by current NN and σ\sigma, but reduce batch size when increasing UTD.
  3. Use validation TD error to detect TD-overfitting—avoid blindly increasing batch/UTD on small models.
  4. Joint scaling of both model size and UTD is superior to increasing either dimension alone, but law of diminishing returns and extensive "flat" plateaus of near-optimality are present.
  5. Batch size and learning rate are locally less sensitive within the "compute-optimal" region; gross misallocation primarily comes from improper NN, σ\sigma, or ignoring TD-overfitting.

6. Mathematical Summary of Scaling Laws

Compute:CJ(σ,N)σNDJ(σ,N) Data Efficiency:DJ(σ,N)DJmin+aJσαJ+bJNβJ Batch Size Limit:B~(σ,N)aBσαB+bBσαBNβB Optimal UTD and Model Size:{σF(F0)=aFF0αF NF(F0)=bFF0βF\begin{align*} &\textbf{Compute:} \quad \mathcal{C}_J(\sigma, N) \propto \sigma N \mathcal{D}_J(\sigma, N) \ &\textbf{Data Efficiency:} \quad \mathcal{D}_J(\sigma, N) \approx D_J^{\min} + a_J \sigma^{\alpha_J} + b_J N^{-\beta_J} \ &\textbf{Batch Size Limit:} \quad \tilde{B}(\sigma, N) \approx \frac{a_B}{\sigma^{\alpha_B} + b_B \cdot \sigma^{\alpha_B} N^{-\beta_B}} \ &\textbf{Optimal UTD and Model Size:} \quad \begin{cases} \sigma^*_F(F_0) = a_F F_0^{\alpha_F} \ N^*_F(F_0) = b_F F_0^{\beta_F} \end{cases} \end{align*}

All coefficients (aBa_B, bBb_B, aJa_J, etc.) are determined by fitting to empirical data from a given RL domain and architecture.

7. Implications and Outlook

The established scaling framework enables practitioners to predict, given constraints on compute or data, the optimal allocation of resources for training value-based deep RL with maximal sample efficiency. The key practical deviation from supervised scaling is the phenomenon of TD-overfitting, which mandates careful coordination between batch size, model capacity, and UTD. Increasing either model size or UTD independently improves efficiency only up to the batch size or TD-overfitting constraint for the given configuration; joint scaling is required for further gains.

In conclusion: compute-optimal scaling in deep RL is characterized by predictable empirical power-law scaling, a critical role for validation-based overfitting diagnostics, and a nuanced tradeoff between axes—a direct generalization of supervised Chinchilla-style scaling, but fundamentally constrained by the non-stationarity and recursively-defined targets unique to TD learning. Practitioners should monitor for TD-overfitting, scale both model size and UTD per empirical scaling laws, and prefer large-batch training only where permitted by model capacity and observed validation generalization. These results close the gap between best practice in LLM scaling and the demands of large-scale value-based RL.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Whiteboard

Follow Topic

Get notified by email when new papers are published related to Compute-Optimal Scaling Strategies.