Scaling Laws for Data Mixture Optimization

Updated 26 March 2026

The paper introduces mathematical formulations and convex optimization principles that rigorously predict domain-specific losses under varied data mixtures.
It employs mixture-aware scaling laws to fit key parameters from low-cost experiments, achieving precise extrapolation with mean relative errors below 1%.
The framework enables automated data mixture selection to optimize performance trade-offs across heterogeneous domains and dynamic training scenarios.

Scaling laws for data mixture optimization govern the quantitative relationship between domain mixture ratios and model generalization performance during foundation model pretraining and fine-tuning. Recent advances rigorously formalize this relationship in the context of neural scaling laws, enabling the principled and automated selection of data mixtures that maximize or trade off performance across heterogeneous domains, languages, or modalities, under fixed compute or data budgets. Mixture-aware scaling laws are now a central tool in state-of-the-art LLM and foundation model pipelines, allowing systematic efficiency gains over heuristic, grid-search, or purely intuition-driven approaches.

1. Mathematical Formulations of Mixture-aware Scaling Laws

Across leading works, the canonical mixture optimization problem is formulated as follows:

Given $K$ disjoint data domains $D_1,\ldots,D_K$ , mixture weights $\alpha = (\alpha_1,\ldots,\alpha_K)$ on the simplex $\Delta_K$ , and a fixed token budget $N$ , one seeks

$\min_{\alpha \in \Delta_K} L(\alpha) = \sum_{i=1}^K L_i(\alpha)$

where $L_i(\alpha)$ is the expected held-out loss on a domain-specific validation set as a function of the effective data received by $D_i$ . All recent frameworks (Li et al., 16 Aug 2025, Shukor et al., 12 Jul 2025, Ye et al., 2024, Li et al., 9 Mar 2026) incorporate two key effects:

In-domain and transferred data: Effective data for domain $i$ generally comprises direct allocation, $\alpha_i N$ , and sublinearly transferred data from all other domains,

$D^{\rm eff}_i = \alpha_i N + k_i (N - \alpha_i N)^{a_i}$

where $a_i \in (0,1)$ reflects diminishing transfer (Li et al., 16 Aug 2025).

Power-law (or exponential) decay of loss in effective data size:

$L_i(\alpha) = C_i (D^{\rm eff}_i)^{-b_i} + E_i$

$L_i(r) = c_i + k_i \exp\left(\sum_{j=1}^K t_{ij}r_j\right)$

as in exponential "data mixing laws" (Ye et al., 2024).

More complex mixture-aware scaling laws manage interactions between mixture coefficients, model size, and token budget: $\mathcal L(N, D, h) = E + \frac{A}{N^\alpha} + \frac{B}{D^\beta} + \frac{1}{\sum_i C_i h_i^{\gamma_i}}$ and its "joint" generalization (Shukor et al., 12 Jul 2025), as well as capacity-aware laws that optimize an explicit resource allocation under mixture (Li et al., 9 Mar 2026).

2. Parameterization, Fitting, and Extrapolation

Law parameters $(C_i, k_i, a_i, b_i, E_i)$ or $(c_i, k_i, t_{ij})$ must be fit using small-scale proxy runs:

For each domain, run low-cost fine-tuning or pretraining experiments varying $\alpha_i$ .
Record corresponding validation losses on domain $i$ .
Fit domain-wise power/exponential-law parameters using robust loss (e.g., Huber) (Li et al., 16 Aug 2025, Ye et al., 2024).

Scaling law universality underpins their extrapolation: parameters estimated from small or moderate-scale runs reliably predict loss at larger compute, model size, or unseen mixtures, e.g., mean relative error $\lesssim 1\%$ for LLMs at 1–7B scale, $<5\%$ up to 55B (Shukor et al., 12 Jul 2025, Li et al., 9 Mar 2026). Bayesian and multi-fidelity approaches further model epistemic uncertainty in extrapolations at high cost (Yen et al., 26 Mar 2025).

Mixture optimization is then a constrained convex problem, leveraging closed-form gradients and simplex projectors, supporting domains from code and natural language to multimodal and multilingual (Li et al., 16 Aug 2025, Shukor et al., 12 Jul 2025, Cao et al., 18 Mar 2026).

3. Convex Optimization and Existence of Unique Minima

The mathematical structure of mixture-aware laws supports convexity, ensuring global minimizers exist and efficient optimization is tractable:

Each domain's loss as a function of its mixture weight is convex due to the composite of a convex, non-increasing power/exponential function with a concave or linear argument (Li et al., 16 Aug 2025, Ye et al., 2024).
Summing over domains preserves convexity under the simplex constraint $\sum_i \alpha_i = 1$ .
For mixture laws coupled to model size, joint convexity in $(r, M)$ also holds under mild "homogeneity" assumptions (Li et al., 9 Mar 2026).

Optimization is commonly solved via projected gradient methods, sequential least-squares programming (SLSQP), or mirror descent on the simplex (Shukor et al., 12 Jul 2025, Li et al., 16 Aug 2025, Cao et al., 18 Mar 2026). KKT conditions guarantee that no domain with positive marginal benefit is starved.

4. Empirical Results and Validation

Empirical studies confirm the predictive and prescriptive power of scaling-law-based mixture optimization:

Supervised fine-tuning: Reweighting large public SFT datasets (Tulu3, Orca) decreases average PPL from 2.355 to 2.150 with improved downstream MMLU, AGIEval, etc.; mixture-optimized models are on average only $0.66\%$ worse than best grid search (Li et al., 16 Aug 2025).
Foundation model pretraining: Mixture scaling laws predict optimal domain weights with mean relative error $\lesssim 1\%$ across LLM, NMM, LVM benchmarks (Shukor et al., 12 Jul 2025); large models (55B) trained once with optimized mixtures achieve weighted average accuracy gains of up to 3% at half the compute compared to baselines (Li et al., 9 Mar 2026).
Continual pretraining: Power-law fits of loss in domain/general mixing ratio precisely predict the critical ratio for balanced retention/transfer (e.g., 29.8–47.8% domain at different model sizes) (Gu et al., 2024).
Multilingual/game-theoretic frameworks: Shapley value–augmented scaling laws outperform prior mixture heuristics in loss prediction ( $R^2 \approx 0.91-0.99$ ) and downstream few-shot accuracy; convex optimization over simplex is efficient even with cross-lingual transfer (Cao et al., 18 Mar 2026).
Synthetic/natural mixtures: Empirically optimal rephrased synthetic content ratio is $\approx$ 30% across N, D; tokens-to-loss-plateau reduced up to 10 $\times$ (Kang et al., 2 Oct 2025).

5. Extensions: Domain- and Scenario-specific Optimization

Scaling law frameworks generalize to a broad range of mixture selection tasks:

Domain-focused objectives: Loss summands can be reweighted according to business or application importance, enabling the design of custom domain-specialized or even legal/medical-centric LLMs with improved held-out or benchmark performance (Li et al., 16 Aug 2025, Li et al., 9 Mar 2026).
Synthetic/real tradeoffs: Mixture laws allow efficient balancing of synthetic vs. natural data allocation to mitigate data supply constraints and prevent model collapse, with optimal synthetic fractions typically $20\%$ – $40\%$ at large scales (Kang et al., 2 Oct 2025).
Continual/dynamic scheduling: Laws generalize to continual pretraining, predicting critical mixture ratios that maximally transfer new domain knowledge while preventing catastrophic forgetting of general skills (Gu et al., 2024, Bethune et al., 9 Feb 2025). The same principles enable curriculum learning and adaptive schedules (Ye et al., 2024).
Uncertainty-aware search: Probabilistic multi-fidelity frameworks (GP-based) adaptively allocate expensive mixture trials to maximize information gain, achieving up to 3.3 $\times$ speedup in mixture identification for large-model targets (Yen et al., 26 Mar 2025).

6. Limitations, Assumptions, and Open Challenges

Despite their efficacy, current scaling-law-based mixture optimization methods face substantive limitations:

Dependence on held-out set alignment: Gains in validation PPL may not always translate to robust downstream or out-of-domain results unless evaluation distribution matches deployment (Li et al., 16 Aug 2025, Kang et al., 2 Oct 2025).
Power-law regime and generalization: Fits operate in the scaling regime (large $N$ , $D$ ); outside this, and for highly non-stationary or multimodal domain data, deviations may occur (Shukor et al., 12 Jul 2025, Li et al., 9 Mar 2026).
Parameter re-fitting: All parameters are architecture and domain-set specific, necessitating new small-scale pilots for novel scenarios (Shukor et al., 12 Jul 2025, Li et al., 9 Mar 2026).
Computation and overfitting: Up-sampling of small domains, or excessive synthetic content, may risk overfitting; mixture optimization does not itself resolve this, but can interact with data augmentation (Li et al., 16 Aug 2025, Kang et al., 2 Oct 2025).
Theoretical completeness: Most mixture-aware scaling laws are empirical interpolants; understanding higher-order interactions, nonsmooth transfer, or dynamic/online settings is ongoing work (Cao et al., 18 Mar 2026, Jiang et al., 2024).

7. Practical Recommendations and Pipeline Integration

Practitioners aiming to deploy mixture-aware scaling law pipelines are advised to:

Fit per-domain scaling laws from small-scale mixture perturbations, holding all but one domain fixed, to robustly estimate power-law/exponential coefficients.
Validate predicted mixtures against a small number of large-scale ablation experiments, confirming loss predictions and final performance.
Use convex optimization tools that respect the simplex and positive-mass constraints for $\alpha$ (or $r$ , $h$ , $p$ ) vector selection.
For domain-specialized models, aggregate or amplify loss terms corresponding to desired target capabilities. For general-purpose models, use uniform or Pile-like validation sets and ignore ad hoc domain tuning.
Online or dynamic mixture adjustment can be implemented by continuously updating per-domain scaling law fits and re-optimizing mixture allocation, as in Adaptive Data Optimization (Jiang et al., 2024).
Extend mixture laws to downstream metric prediction via, e.g., a logistic "loss-to-accuracy" mapping, for direct end-to-end optimization of task objectives (Li et al., 9 Mar 2026).

Scaling law–driven data mixture optimization provides a rigorous foundation for principled, efficient, and automated data allocation in large-scale model pretraining and fine-tuning, encompassing diverse application domains, loss metrics, and model architectures. The framework has matured into a central pillar of contemporary foundation model development (Li et al., 16 Aug 2025, Shukor et al., 12 Jul 2025, Li et al., 9 Mar 2026, Gu et al., 2024).