Papers
Topics
Authors
Recent
Search
2000 character limit reached

Data Mixture and Scaling in Machine Learning

Updated 2 June 2026
  • Data Mixture and Scaling is the process of selecting data-source proportions and adjusting model scale to maximize performance and compute efficiency in large-scale machine learning.
  • Scaling laws quantitatively relate model loss to mixture ratios, model size, and training tokens, enabling efficient optimization without exhaustive trial-and-error.
  • Advanced methods such as multi-fidelity Bayesian optimization, regression proxies, and convex relaxation provide practical strategies for cost-effective and transferable mixture selection.

Data mixture and scaling concern the principled selection and adjustment of data-source proportions ("mixtures") in large-scale machine learning, with particular emphasis on efficient strategies for model pretraining, fine-tuning, and cross-domain transfer. At stake are both predictive power and computational efficiency; naive trial-and-error is prohibitively expensive for modern multi-domain training regimes. Recent advances have formalized the data-mixture optimization problem and developed algorithmic, statistical, and scaling-law-based frameworks that render mixture selection both tractable and transferable across model scales.

1. Formal Problem Definition and Bayesian Optimization Frameworks

The core problem is to select mixture weights α=(α1,...,αn)∈Δn\alpha = (\alpha_1, ..., \alpha_n) \in \Delta^n on the nn-simplex—each entry denoting the sampling fraction of a data source—along with model scale ss and training steps tt, so as to maximize downstream model performance f(α,s,t)f(\alpha, s, t) (e.g., validation loss or accuracy), under a compute cost constraint c(s,t)≤Bc(s, t) \leq B (Yen et al., 26 Mar 2025). The performance function ff is generally unknown and expensive to evaluate at large scale.

Data mixture optimization is thus naturally cast as a sequential decision-making problem, often approached by multi-fidelity, multi-scale Bayesian optimization (MFMS-BO). In this method, evaluations at varying (α,s,t)(\alpha, s, t) (from cheap, low-fidelity proxies to full-scale expensive runs) are used to update a joint Gaussian process prior: f∼GP(m((α,s,t)), k((α,s,t),(α′,s′,t′)))f\sim\mathcal{GP}\big(m((\alpha,s,t)),\,k\big((\alpha,s,t),(\alpha',s',t')\big)\big) where m(⋅)m(\cdot) is a linear or constant mean function and nn0 is a product kernel over mixture weights, model scale, and training steps. Acquisition functions such as Expected Improvement per Unit Cost drive efficient exploration of the mixture/scale/step landscape, enabling rapid "zoom-in" on promising mixtures at large scale with compute efficiency (Yen et al., 26 Mar 2025).

2. Scaling Laws for Data Mixture Selection

A central development is the systematic derivation of scaling laws that quantitatively predict model loss as a function of mixture proportions, model size, and training tokens. These laws provide functional forms nn1 (with nn2 parameters, nn3 data tokens, nn4 mixture vector) whose parameters can be learned from small-scale pilot runs and then used to predict or optimize performance at large scales (Shukor et al., 12 Jul 2025, Ye et al., 2024, Sedova et al., 12 May 2026):

  • Additive/joint mixture scaling:

nn5

Domain-specific coefficients nn6 capture the heterogeneous returns on different sources. This law extrapolates reliably across scales and mixture regimes; optimal nn7 is found by differentiable simplex-constrained minimization (Shukor et al., 12 Jul 2025).

  • Repetition-aware mixture laws:

In low-resource or data-constrained settings, optimal mixtures depend crucially on the repetition factor nn8 with diminishing returns modeled via sublinear effective token budgets (Sedova et al., 12 May 2026).

  • Information scaling laws (InfoLaw): Validation loss is accounted for by cumulative "information" from each quality bucket, with strong scale- and repetition- dependent diminishing returns:

nn9

where Info is an explicit function of the mixture weights ss0, data quality, size, and model scale (Liu et al., 4 May 2026).

These laws can be calibrated on small runs and used to determine optimal mixture recipes under fixed compute, even for multi-billion parameter models (Shukor et al., 12 Jul 2025, Liu et al., 4 May 2026).

3. Algorithmic and Proxy-based Data Mixture Optimization

Several methodologies have been developed to enable compute-efficient search over combinatorial mixture spaces:

  • Regression proxies: Methods like RegMix train small models on sampled mixtures and fit a regression predictor ss1 to model mixture effects, enabling scalable search via proxy evaluations and robust mixture selection (Liu et al., 2024).
  • Convex relaxation (MixMin): In the large-model (Bayes) limit, the optimal mixture problem becomes convex in the data mixture vector; gradient-based algorithms on the simplex can efficiently find globally optimal mixtures, with scale invariance observed empirically (Thudi et al., 14 Feb 2025).
  • Model merging proxies: For both language and multimodal models, linearly merged experts trained on individual domains provide high-rank-correlation surrogates for the downstream performance of true mixture-trained models. Both DeMix and analogous multimodal DMO pipelines can evaluate millions of mixtures at proxy cost, decoupling search from expensive retraining (Li et al., 31 Jan 2026, Berasi et al., 4 Feb 2026).
  • Bayesian multi-fidelity optimization: The MFMS-BO framework (Gaussian process surrogate, multi-fidelity, multi-scale sampling, cost-sensitive acquisition) can match or outperform random/grid search and alternative Bayesian optimization baselines by ss2–ss3 in search efficiency (Yen et al., 26 Mar 2025).

4. Mixture Effects: Empirical Behavior and Universal Trade-offs

Empirical investigations across LLM, vision, and multimodal settings have identified characteristic tradeoffs and mixture effects:

  • Scale dependence: Optimal mixtures are not static—weights that maximize performance at small scales differ from those at large scales. As model scale increases, domain weights for e.g., knowledge or general web text typically increase, while those for specialist or structured data may diminish (Yen et al., 26 Mar 2025, Li et al., 9 Mar 2026).
  • Diminishing returns and repetition: Under low-resource conditions, data repetition can be leveraged much more aggressively in mixtures (up to 15–20x) than in single-domain training due to regularization from generic data streams (Sedova et al., 12 May 2026). Mixture optimization must balance the value of repeated, scarce data against overfitting and diminishing marginal utility.
  • Negative transfer: Indiscriminate mixture of heterogeneous or poorly aligned sources can induce negative transfer, especially in cross-embodiment or cross-modal settings (as in VLA robotics); careful matching and balancing by action space or sensor setup is crucial (Wang et al., 10 Feb 2026).
  • Synthetic/real mixtures: Mixtures with synthetic data display phase transitions: head knowledge is acquired rapidly (Phase 1), but tail generalization requires a threshold of real data to escape plateau regimes (Phases 2–3). For long-tailed domains, real data must form at least ss4–ss5% of the mixture to adequately learn rare knowledge (Wang et al., 17 Nov 2025, Kang et al., 2 Oct 2025).
  • Continual pretraining: Mixture scaling laws and the critical mixture ratio (CMR) framework reveal that the optimal domain-to-generic mix during continual pretraining scales as a simple power of total token budget, rising smoothly with data and model scale (Gu et al., 2024).

5. Practical Prescriptions and Scaling Recipes

Predictive mixture-scaling laws enable a substantial reduction in required mixture-sweep compute. Key prescriptions include:

  • Use of scaling laws and proxy models: Fit a parametric law (e.g., additive or InfoLaw-style) using diverse, small-scale or proxy runs, then solve for optimal mixture weights ss6 under target-scale constraints via gradient or mirror-descent optimization on the simplex (Shukor et al., 12 Jul 2025, Liu et al., 4 May 2026, Ye et al., 2024).
  • Data repetition: For scarce-domain adaptation, calculate the target repetition factor ss7 via the fitted law, and select the highest ss8, ss9 compatible with the budget; mixture training can safely tolerate many more repetitions than single-source settings (Sedova et al., 12 May 2026).
  • Synthetic data: For LLM pretraining, mixtures with tt030% high-quality rephrased synthetic data plus 70% web data afford speedups of tt1–tt2 without model collapse; textbook-style synthetic should remain tt310–15% (Kang et al., 2 Oct 2025).
  • Fine-tuning with anchor loss: For transfer or continual learning, inject a modest fraction (tt41%) of pretraining or generic data into fine-tuning to virtually eliminate catastrophic forgetting, consistent across a wide range of model sizes and domains (Bethune et al., 9 Feb 2025).
  • Architecture and scaling-aware tuning: Capacity-aware mixture laws (e.g., CAMEL) allow extrapolation of mixtures from expert-trained MoE models to large dense or MoE targets, enabling efficient grid-free discovery of optimal mixtures for domain-specialized or balanced objectives (Li et al., 9 Mar 2026).

6. Transferability and Extensions to Multimodal or Structured Data

Mixture-scaling frameworks are adaptable to structured prediction, multimodal pretraining, and settings with latent heterogeneity:

  • Multimodal and MoE systems: Progressive connector–expert–MoE pipelines (e.g., Uni-MoE) combine data mixture tuning with sparse activation for efficient scaling and generalization across tasks (Li et al., 2024). In multimodal SFT, model-merging proxies achieve near-optimal mixture selection with 10tt5–40tt6 compute savings and robust ranking, generalizing to tt7 domains (Berasi et al., 4 Feb 2026).
  • High-dimensional mixture modeling: In domains requiring interpretable regression or classification on high-dimensional, heterogeneous data, scalable penalized joint mixture models (e.g., S-RJM) integrate feature reduction and sparsity with EM convergence guarantees (Lartigue et al., 2022).
  • Label-switching and Bayesian mixture models: For scalable Bayesian inference, minimum-variance relabelling or allocation-space algorithms solve the label-switching problem efficiently in high-tt8 or high-tt9 settings (Zhu et al., 2014).

7. Limitations, Assumptions, and Open Challenges

Current mixture optimization and scaling approaches offer substantial gains but also rely on several assumptions:

  • All empirical scaling-law frameworks assume performance monotonicity and local smoothness of the loss surface in mixture space; rare, highly heterogeneous domains may violate these prerequisites (Li et al., 9 Mar 2026).
  • Data repetition laws require large enough generic streams for regularization. With highly related or minuscule target datasets, approximations may break down and require correction (Sedova et al., 12 May 2026).
  • Synthetic/real mixture regimes depend on class coverage; deep long-tail distributions need carefully managed real-data ratios to avoid stagnation or collapse (Wang et al., 17 Nov 2025).
  • Most frameworks do not currently handle dynamic (time-varying) mixtures, curriculum learning, or adaptive scheduling. Integration of domain-relatedness, task-specific or personalized mixture tuning, and hyperparameter–mixture interactions are active areas for extension (Shukor et al., 12 Jul 2025, Ye et al., 2024).

References:

Definition Search Book Streamline Icon: https://streamlinehq.com
References (18)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Data Mixture and Scaling.