Compute-Optimal Scaling in Deep RL
- Compute-optimal scaling strategies are methods for optimally allocating finite computational resources among model capacity, batch size, and update-to-data ratio in deep reinforcement learning.
- They use empirical scaling laws and power-law relationships to balance tradeoffs between model size and UTD, enhancing sample efficiency under fixed compute budgets.
- The framework emphasizes mitigating TD-overfitting by adjusting batch sizes and model capacities based on the dynamic quality of temporal-difference targets.
Compute-optimal scaling strategies define how to allocate finite computational resources between different scaling axes—such as model capacity, batch size, and update-to-data (UTD) ratio—in value-based deep reinforcement learning (RL), with the goal of maximizing sample efficiency or minimizing the time to a target return. These strategies are fundamentally distinct from supervised learning analogues due to the unique structure of temporal-difference (TD) learning, where targets are dynamically generated and coupled to the current Q-network. The following sections synthesize the empirical findings, theoretical laws, and practical recommendations for compute-optimal scaling in value-based deep RL, as established in the study of model and UTD scaling (Fu et al., 20 Aug 2025).
1. Scaling Axes and the Compute-Optimal Allocation Problem
Value-based deep RL admits two principal axes for compute allocation:
- Model Capacity (): The parameter count or width of the Q-network.
- UTD Ratio (): The number of Q-network updates per new environment or data step ("updates-to-data").
Given a fixed compute (FLOPs) budget, the central question is how to partition resources between and to achieve the fastest learning or best asymptotic return per sample and per unit compute.
Compute cost to reach a return level is given as: where is the amount of data required to reach at given settings.
2. Empirical Structure: Model Size, Batch Size, and UTD Interactions
Distinct Scaling Regimes:
- Small models: Best performance is obtained with small batch sizes; aggressive batch scaling causes overfitting.
- Large models: With increased capacity, larger batch sizes do not degrade, and often improve, Q-value generalization.
- Effect of UTD: At constant model size, higher (more updates per data step) reduces the maximum safe batch size.
TD-Overfitting:
- Definition: For small models, increasing batch size reduces training TD error but increases validation TD error, indicating overfitting to low-quality, unstable TD targets.
- Unlike supervised learning where targets are stationary, the quality of TD targets is limited by model capacity itself. Large batch TD-updates exacerbate this dynamic for small models.
- Resolution: Increase model size, which improves the quality of computed TD targets, allowing larger batch sizes without overfitting.
Maximal Safe Batch Size Formula:
where , , , are empirical constants. This formula encodes the increase of batch size tolerance with larger and decrease with higher .
3. Empirical Scaling Laws for Compute and Sample Efficiency
Data Efficiency Fit:
with all coefficients positive.
Implications:
- increases with UTD and decreases with , but with diminishing returns.
- At fixed compute or data, the optimal frontier lies at predictable power-law tradeoffs between and .
Optimal Scaling under a Fixed Data Budget ():
When considering both data and compute costs (budget with the cost of data),
where all parameters are fit on empirical data.
Computational Cost Summary Table
| Quantity | Formula | Interpretation |
|---|---|---|
| Compute to reach | Total necessary training FLOPs | |
| Data efficiency | Data to reach as a function of and | |
| Maximal safe batch size | Avoids TD-overfitting |
4. Diagnosing and Managing TD-Overfitting
Mechanism:
- Overfitting in TD-learning does not result primarily from data memorization, but from overfitting to poor quality TD targets—themselves a function of limited capacity and underconstrained targets.
- Passive Critic Experiment: Training a separate network to match TD targets generated by the Q-network shows that generalization is limited primarily by the target quality, not just memorization of sampled data.
Countermeasures:
- Monitor validation (not just training) TD error as both batch size and UTD increase.
- For small models, reduce batch size as UTD increases; consider increasing model size to unlock safe larger batch sizes.
5. Practical Guidance for Compute-Optimal Scaling
- Optimize model size () and UTD () using empirical scaling fits—partition compute via the empirical scaling formulas to maximize sample efficiency.
- Set batch size as large as permitted by current and , but reduce batch size when increasing UTD.
- Use validation TD error to detect TD-overfitting—avoid blindly increasing batch/UTD on small models.
- Joint scaling of both model size and UTD is superior to increasing either dimension alone, but law of diminishing returns and extensive "flat" plateaus of near-optimality are present.
- Batch size and learning rate are locally less sensitive within the "compute-optimal" region; gross misallocation primarily comes from improper , , or ignoring TD-overfitting.
6. Mathematical Summary of Scaling Laws
All coefficients (, , , etc.) are determined by fitting to empirical data from a given RL domain and architecture.
7. Implications and Outlook
The established scaling framework enables practitioners to predict, given constraints on compute or data, the optimal allocation of resources for training value-based deep RL with maximal sample efficiency. The key practical deviation from supervised scaling is the phenomenon of TD-overfitting, which mandates careful coordination between batch size, model capacity, and UTD. Increasing either model size or UTD independently improves efficiency only up to the batch size or TD-overfitting constraint for the given configuration; joint scaling is required for further gains.
In conclusion: compute-optimal scaling in deep RL is characterized by predictable empirical power-law scaling, a critical role for validation-based overfitting diagnostics, and a nuanced tradeoff between axes—a direct generalization of supervised Chinchilla-style scaling, but fundamentally constrained by the non-stationarity and recursively-defined targets unique to TD learning. Practitioners should monitor for TD-overfitting, scale both model size and UTD per empirical scaling laws, and prefer large-batch training only where permitted by model capacity and observed validation generalization. These results close the gap between best practice in LLM scaling and the demands of large-scale value-based RL.