Papers
Topics
Authors
Recent
Assistant
AI Research Assistant
Well-researched responses based on relevant abstracts and paper content.
Custom Instructions Pro
Preferences or requirements that you'd like Emergent Mind to consider when generating responses.
Gemini 2.5 Flash
Gemini 2.5 Flash 63 tok/s
Gemini 2.5 Pro 44 tok/s Pro
GPT-5 Medium 31 tok/s Pro
GPT-5 High 32 tok/s Pro
GPT-4o 86 tok/s Pro
Kimi K2 194 tok/s Pro
GPT OSS 120B 445 tok/s Pro
Claude Sonnet 4.5 35 tok/s Pro
2000 character limit reached

Regularized Scaling Recipe

Updated 19 September 2025
  • Regularized scaling recipe is a systematic framework that combines aggressive weight decay, cosine learning rate schedules, and joint tuning to minimize pre-training loss.
  • It employs predictable power-law scaling laws, where validation loss decreases monotonically with increased model capacity and epochs when properly regularized.
  • The approach enhances data efficiency and downstream task performance through ensemble averaging and knowledge distillation, making training robust in compute-rich, data-poor regimes.

A regularized scaling recipe is a systematic set of interventions—based on hyperparameter tuning, regularization, model scaling, and ensembling—used to maximize generalization and data efficiency in pre-training LLMs under strict data constraints and unlimited compute. The approach is motivated by the increasing disparity between available web text and the rapid growth of computational capability. As compute far outpaces data acquisition, overfitting becomes the limiting factor, especially when scaling epoch counts and model parameters. The central innovation of the regularized scaling recipe is the use of aggressive and carefully tuned regularization, particularly weight decay, combined with joint scaling of parameters, epochs, and ensembling, to monotonically drive down pre-training loss and improve downstream efficiency. This framework provides robust scaling laws and asymptotic performance bounds that inform practical strategies for future compute-rich, data-poor training regimes (Kim et al., 18 Sep 2025).

1. Aggressive Regularization Strategies

The core of the regularized scaling recipe is the use of weight decay at levels far greater than conventional practice. Prior commonly adopted values for weight decay are on the order of $0.1$; empirical results show that for fixed datasets, optimal weight decay can be as much as 30×30\times higher. This increase is vital for suppressing overfitting when resampling the same data via additional epochs or scaling model width.

The recipe also incorporates additional regularization-related best practices: cosine learning rate schedules with warmup phases, mixed-precision training, and gradient clipping (at norm 1). These help stabilize training dynamics as model and epoch scales increase. Joint tuning of learning rate, epoch count, and weight decay is performed using local coordinate descent: at each step, only one parameter is varied, and the new configuration is accepted if it yields a lower loss on validation. Formally, for a given hyperparameter tuple HH, local optimality is defined as

L[A(D,N,H)]L[A(D,N,H)]L[A(D, N, H)] \leq L[A(D, N, H')]

for all HH' in the immediate neighborhood of HH (with one coordinate changed).

This tuning ensures that early loss drops are deliberately slowed, with regularization acting most effectively near the termination of training where overfitting risks peak.

2. Model and Epoch Scaling Laws

Upon optimal regularization, validation loss exhibits predictable power-law scaling with respect to model capacity. The relationship can be fit as

L(D,N)ADNαD+EDL(D, N) \approx \frac{A_D}{N^{\alpha_D}} + E_D

where NN is the number of parameters, ADA_D and αD\alpha_D are fit coefficients, and EDE_D is the loss asymptote for data size DD.

This scaling remains monotonic if epoch count, learning rate, and weight decay are simultaneously tuned; otherwise, unregularized scaling leads to early overfitting, violating power-law behavior. The “loss asymptote” EDE_D thus provides a well-defined measure for the limit of achievable performance under infinite compute and fixed data.

Analogous scaling laws apply to ensembles (see below), and composite laws emerge when both model size and number of ensemble members are scaled.

3. Ensemble Averaging for Lower Asymptote

The recipe incorporates ensembling as a means to further reduce loss below the single-model regularized asymptote. An ensemble of KK independent models, trained with different initialization seeds and data order, is formed; at inference, the logits of the KK models are averaged (“logit averaging”). Formally,

EA(D,N,K,H)=logitavg({A(D,N,Zi,H)}i=1K)E_A(D, N, K, H) = \text{logitavg}\left(\{A(D, N, Z_i, H)\}_{i=1}^K\right)

where ZiZ_i indexes each run’s randomness.

Empirically, as KK increases, the limiting validation loss approaches $3.34$ for $300$M-parameter models—significantly below the single-model asymptote ($3.43$). When ensemble-specific hyperparameters are tuned (e.g. double epoch count, half weight decay), further reduction is realized, reaching $3.27$ in certain scenarios. The ensemble scaling law shows loss decreases roughly as $1/K$, corresponding to aggregation over diverse views not captured in any single run.

4. Data Efficiency and Knowledge Distillation

Compared to standard repeated epoch recipes (which quickly overfit), the regularized scaling recipe achieves far greater data efficiency: at 200M tokens, the optimal recipe achieves the same loss using 5.17×5.17\times less data than baseline. This improvement persists with larger datasets due to the monotonic nature of regularized scaling laws.

Furthermore, knowledge distillation allows the ensemble benefit to be efficiently “compressed”. Distilling an ensemble of $8$ independently trained $300$M-parameter models yields a student model that is 8×8\times smaller, retaining approximately 83%83\% of the ensemble’s improvement. Self-distillation (using a teacher to generate pseudo-labels for a student) further boosts efficiency.

5. Performance Metrics and Downstream Generalization

Performance is primarily measured by training and validation loss asymptotes. Regularized scaling and ensembling combine to produce lower asymptotic loss than unregularized or single-model recipes. In downstream evaluations (e.g., PIQA, SciQ, ARC Easy), the best interventions yield a 9%9\% improvement for pre-training evaluation and a 17.5×17.5\times improvement in data efficiency compared to standard continued pre-training.

This suggests that regularized scaling recipes designed for unsupervised training loss transfer robustly to real-world tasks, not just artificial validation curves.

6. Practical Implications and Future Directions

In data-constrained, compute-rich regimes, the regularized scaling recipe enables practitioners to avoid overfitting—allowing continuous improvement as epochs, model size, or ensemble count increases, bounded only by the asymptotic limits established by scaling laws. Rather than simply increasing model capacity or repeating data, joint hyperparameter tuning and ensemble methods are necessary for optimal data use.

A plausible implication is that knowledge distillation may become increasingly central, providing a pathway to encode ensemble generalization into tractable model sizes for deployment. Additionally, the recipe establishes benchmarks for loss asymptotes under infinite compute and fixed data, supporting more principled planning of large-scale pre-training investments.

Table: Asymptotic Loss Under Different Scaling Strategies

Strategy Loss Asymptote Data Efficiency Gain
Standard (Unregulated) overfits (>3.6) 1x
Regularized, Tuned 3.43 5.17x
Ensemble (K→∞) 3.34–3.27 >5x
Distillation (student) ~3.38 8x model compression

The table reflects values reported for single-model, ensemble, and distilled settings at $200$M tokens and $300$M parameters (Kim et al., 18 Sep 2025).

Summary

The regularized scaling recipe defines a systematic framework for maximizing generalization and data efficiency in large-scale pre-training under fixed data and unlimited compute. By combining aggressive hyperparameter tuning, heavy regularization, and ensemble methods—along with judicious application of distillation—the approach monotonically reduces loss following robust scaling laws, yielding substantial improvements over naive scaling or repeated epoching. This methodology provides a principled basis for pre-training strategy in future compute-rich, data-constrained environments.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)
Forward Email Streamline Icon: https://streamlinehq.com

Follow Topic

Get notified by email when new papers are published related to Regularized Scaling Recipe.

Don't miss out on important new AI/ML research

See which papers are being discussed right now on X, Reddit, and more:

“Emergent Mind helps me see which AI papers have caught fire online.”

Philip

Philip

Creator, AI Explained on YouTube