Pre-training Scaling Problem
- Pre-training scaling problem is the challenge of predicting neural model performance as architectures, data volumes, and sparsity schedules grow larger.
- It encompasses unified scaling laws that span dense and sparse paradigms, multi-stage bootstrapping, domain-specific mixtures, and transfer learning between synthetic and real data.
- Methodological innovations like optimal pruning schedules, robust nonlinear fitting, and pilot experiments guide efficient resource allocation and model-data trade-offs.
The pre-training scaling problem refers to the challenge of accurately modeling, predicting, and optimizing the final performance of neural pretraining pipelines—across model architectures, data regimes, and task domains—as networks and datasets increase in size and complexity. This problem encompasses both the empirical and theoretical development of scaling laws, the breakdown of such laws in specialized settings, and the principled calibration of resources for effective training across dense, sparse, multi-stage, multi-domain, and multi-modality paradigms. Central to this issue is the search for universal or regime-specific scaling behaviors that can guide decisions on model size, data quantity, sparsity schedules, domain data mixtures, and training configurations for the next generation of foundation models.
1. Central Mathematical Scaling Laws
At the core of the pre-training scaling problem are scaling laws: empirical relationships that relate model performance metrics (e.g., cross-entropy loss, classification error) to model parameters , number of pre-training tokens , or other regime-specific quantities. The archetypal dense-law for LLMs is the Chinchilla scaling law:
where are fitted constants, and are exponents for model and data size, respectively. This law has been generalized to account for variable parameter counts and regime transitions. For sparse pre-training, the law is unified by replacing by the average parameter count over the full pre-training trajectory:
where
captures the effective model capacity expended during the training process. Empirical results confirm this formulation achieves across a wide grid of sparsity schedules and data budgets, yielding a unified dense/sparse regime description (Jin et al., 21 Jan 2025).
For certain transfer and multi-stage settings, scaling law forms become inherently multi-variate. In bootstrapped pre-training (e.g., continual pre-training or model growth), the joint law is:
where and are tokens seen during first and second stage pre-training, respectively, and the "effective exponent" for additional stage-2 tokens logarithmically decays in (Liew et al., 8 Oct 2025).
For domain-specific continual pre-training, the D-CPT Law introduces explicit dependence on the mixture ratio of domain to general-corpus data:
allowing for resource-optimal mixing with minimal pilot data (Que et al., 3 Jun 2024).
Transfer learning between synthetic and real data is captured by:
linking synthetic pre-training set size and real fine-tune set size (Mikami et al., 2021).
2. Regimes of Scaling Law Validity, Breakdown, and Empirical Support
While many settings demonstrate precise power-law or log-linear scaling, several notable breakdowns and regime shifts have been observed:
- Sparse vs. Dense LLM Pre-training: The scaling law in unifies both; dense and sparse schedules with matched land on the same loss curves, allowing for flexible allocation between training compute, inference cost, and sparsity (Jin et al., 21 Jan 2025).
- Multi-stage Pre-training Saturation: Bootstrapped scaling efficiency (the exponent governing returns to more data) decays logarithmically with the length of pre-training. There is thus a regime where continued pre-training gives rapidly diminishing returns, overtaken (at large enough budgets) by training from scratch on the combined data (Liew et al., 8 Oct 2025).
- Domain Continual Pre-training: The D-CPT Law accommodates arbitrary mixture ratios, model sizes, and token budgets, predicting both general and domain-specific loss surfaces. With cross-domain extension, a single domain difficulty coefficient suffices to extrapolate the scaling curve to unseen domains (Que et al., 3 Jun 2024).
- Breakdown of Power-Law Scaling: In all-atom geometric GNNs, pre-training loss initially follows a power law in model size , but rapidly plateaus for ; further increases yield minimal gain, in stark contrast to Transformers in language/vision (Pengmei et al., 29 Oct 2024).
- Minimal Synthetic Datasets: For Vision Transformers, synthetic pre-training with a single self-similar fractal and subtle perturbations achieves performance on par with million-image real datasets; increasing the synthetic set size beyond a single highly informative image can actually decrease transfer performance—a phenomenon termed "scaling backwards" (Nakamura et al., 1 Aug 2024).
3. Methodological Innovations for Scaling Law Calibration
- Optimal Pruning Schedules: For sparse pre-training, the near-optimal trajectory is to begin pruning at 25% and finish at 75% of total compute, reserving the final 25% for sparse "recovery" (Jin et al., 21 Jan 2025).
- Parameterization and Fitting: Nonlinear least-squares fitting with robust loss (e.g., Huber) accurately recovers scaling law parameters; log-log regression is used for exponent estimation wherever power-law structure holds (Mikami et al., 2021, Jin et al., 21 Jan 2025).
- Pilot Runs and Cross-Domain Law: In domain-specific CPT, a handful of pilot experiments suffice to fit an entire surface for arbitrary , and one or two more short runs are sufficient to extend this fitting to unseen domains with a learnability coefficient (Que et al., 3 Jun 2024).
- Scaling Law Utility Estimation for Data Sources: Resource allocation for domain-specific data sources is driven by marginal utility curves—power-law or log-linear fits predict how much utility (e.g., Brier score lift) will be gained by drawing more tokens from each source at different compute budgets, capturing cross-over points and guiding mixture proportions (Ostapenko et al., 29 Jul 2025).
4. Practical Prescriptions and Optimization
Unified scaling laws enable optimization under diverse practical constraints:
- Fixed Compute: Solve s.t. for optimal model-data allocation (Jin et al., 21 Jan 2025).
- Domain Mixture Optimization: Minimize domain validation loss under a constraint of no more than % degradation on generic data, using the D-CPT Law to solve for optimal (Que et al., 3 Jun 2024).
- Ensembling and Regularization in Data-Limited Regimes: Combining extensive regularization (e.g., 30 weight decay), high epoch counts, parameter scaling, and model ensembling achieves lower loss asymptotes than naïve parameter expansion, and distillation compresses ensemble gains back into compact student models (Kim et al., 18 Sep 2025).
5. Transfer, Modality, and Architecture-Specific Scaling Phenomena
- Scientific ML: Foundation models for scientific tasks follow steeper power-law exponents upon finetuning than scratch-trained models, leading to – data efficiency advantages in high-fidelity PDE surrogates (Subramanian et al., 2023).
- Multi-modal & Multitask Scaling: In multi-modal pre-training, scaling is blocked by label/granularity noise from cross-modality misalignment. Gradient harmonization and curriculum learning based on conflict metrics restore stable positive scaling (Wu et al., 2022). In multi-task LLMs, task prefix methods allow scaling to dozens of tasks without catastrophic negative transfer, enabling both efficient scaling and data-driven task pruning (Zhang et al., 2022).
- Nonlinear or Hybrid Scaling Laws: Certain settings (e.g., bootstrapped pre-training, synthetic to real transfer) require scaling laws with interaction terms or multiple exponents depending on the cumulative amount of pre-training and architectural factors (Liew et al., 8 Oct 2025, Mikami et al., 2021).
6. Limitations, Open Problems, and Future Directions
- Breakdown of Acceleration: In geometric GNNs, architectural limitations and data diversity bottlenecks cause early saturation and invalidate scaling extrapolations. Advances in architecture (e.g., higher-body equivariant models, dynamic receptive fields) and objectives (e.g., topological contrastive, chemical space expansion) are necessary to recover scaling (Pengmei et al., 29 Oct 2024).
- Data and Compute Scaling: In settings such as synthetic chaotic time series for financial forecasting, predictive capability for longer horizons demands quadratic scaling in required sample size— to achieve fixed correlation for horizon . The universality of such relationships across domains is still under investigation (Takemoto, 5 Sep 2025).
- Modality-Specific Bottlenecks: Scaling speech-LLMs to text-LLM size is fundamentally constrained by lack of public speech data; synthesis of interleaved speech-text data and discrete tokenization strategies overcome this, but generalization to other modalities (e.g., video, 3D) has specific scaling bottlenecks (Zeng et al., 26 Nov 2024, Chen et al., 19 Aug 2024).
- Hyperparameter Sensitivity: Scaling law performance can be contingent on careful optimization of secondary parameters (e.g., sparsity trajectory, mixture ratio, regularization strength), yet practical recipes have been identified for replicable gains (Jin et al., 21 Jan 2025, Que et al., 3 Jun 2024, Kim et al., 18 Sep 2025).
- Extrapolation Risks: Point estimation at a single budget is unreliable; only anchor points along the full scaling curve can robustly inform resource allocation or data-mixing strategies due to rank inversion phenomena (Ostapenko et al., 29 Jul 2025).
In sum, the pre-training scaling problem encapsulates both the promise and complexity of systematizing training for high-capacity models in ever larger and more heterogeneous data regimes. It has driven the creation of unified and flexible scaling laws (in dense, sparse, transfer, and multi-task contexts), clarified regimes of rapid acceleration and early saturation, and seeded robust methodologies for principled scheduling of model, data, and compute resources. Ongoing research continues to refine these laws across new architectures, tasks, and modalities, emphasizing the interplay between theory-driven fitting, empirical calibration, and practical optimization.