Papers
Topics
Authors
Recent
AI Research Assistant
AI Research Assistant
Well-researched responses based on relevant abstracts and paper content.
Custom Instructions Pro
Preferences or requirements that you'd like Emergent Mind to consider when generating responses.
Gemini 2.5 Flash
Gemini 2.5 Flash 83 tok/s
Gemini 2.5 Pro 52 tok/s Pro
GPT-5 Medium 25 tok/s Pro
GPT-5 High 30 tok/s Pro
GPT-4o 92 tok/s Pro
Kimi K2 174 tok/s Pro
GPT OSS 120B 462 tok/s Pro
Claude Sonnet 4 39 tok/s Pro
2000 character limit reached

Scaling Laws for Optimal Data Mixtures (2507.09404v1)

Published 12 Jul 2025 in cs.LG

Abstract: Large foundation models are typically trained on data from multiple domains, with the data mixture--the proportion of each domain used--playing a critical role in model performance. The standard approach to selecting this mixture relies on trial and error, which becomes impractical for large-scale pretraining. We propose a systematic method to determine the optimal data mixture for any target domain using scaling laws. Our approach accurately predicts the loss of a model of size $N$ trained with $D$ tokens and a specific domain weight vector $h$. We validate the universality of these scaling laws by demonstrating their predictive power in three distinct and large-scale settings: LLM, native multimodal model (NMM), and large vision models (LVM) pretraining. We further show that these scaling laws can extrapolate to new data mixtures and across scales: their parameters can be accurately estimated using a few small-scale training runs, and used to estimate the performance at larger scales and unseen domain weights. The scaling laws allow to derive the optimal domain weights for any target domain under a given training budget ($N$,$D$), providing a principled alternative to costly trial-and-error methods.

Summary

  • The paper introduces new scaling laws that systematically predict the optimal data mixture using model size, tokens, and domain weights.
  • It extends traditional formulations with both additive and joint models, accurately forecasting losses in large language, vision, and multimodal models.
  • The approach validates its effectiveness through low error rates and practical optimization, reducing reliance on trial-and-error in mixture selection.

Scaling Laws for Data Mixture Optimization

This paper introduces a novel approach to determine the optimal data mixture for training large foundation models across different modalities. Recognizing that the standard trial-and-error method for selecting data mixtures becomes impractical at scale, the authors propose a systematic method based on scaling laws. This method accurately predicts the loss of a model as a function of its size (NN), the number of training tokens (DD), and the domain weight vector (hh). The key innovation is extending traditional scaling laws to explicitly model the impact of domain weights on model performance.

The authors validate the universality of their scaling laws in three distinct large-scale settings: LLMs, native multimodal models (NMMs), and large vision models (LVMs). They demonstrate that these scaling laws can extrapolate to new data mixtures and across scales, estimating parameters using a few small-scale training runs and predicting performance at larger scales and unseen domain weights. This approach offers a principled alternative to costly trial-and-error methods, enabling the derivation of optimal domain weights for any target domain under a given training budget (NN, DD).

The paper formulates the problem of training models with data from kk domains, aiming to predict the loss on a target domain DT\mathcal{D}_T after training a model of size NN with DD tokens using domain weights hh. Two scaling law formulations are proposed:

  1. Additive Scaling Law: This law models only the bias term (EhE^h) as a function of hh, while other parameters (Ah,αh,Bh,βhA^h, \alpha^h, B^h, \beta^h) are constants. The formula is:

    L=E+1∑i=1kCihiγi+ANα+BDβ\mathcal{L} = E + \frac{1}{\sum_{i=1}^kC_i{h_i}^{\gamma_i}} + \frac{A}{N^\alpha} +\frac{B}{D^\beta}

    This law has $5 + 2k$ parameters, and the optimal domain weights are independent of model size NN and the number of tokens DD.

  2. Joint Scaling Law: This law models the terms AhA^h and BhB^h as functions of hh, capturing the interaction between scale and mixture. The formula is:

    L=E+1∑i=1kCihiγi+AhNα+BhDβ with Ah=(∑i=1kCiAhi)γA and Bh=(∑i=1kCiBhi)γB\mathcal{L} = E + \frac{1}{\sum_{i=1}^kC_i h_i^{\gamma_i}} + \frac{A^h}{N^\alpha} + \frac{B^h}{D^\beta} \text{ with } A^h = (\sum_{i=1}^kC^A_i h_i)^{\gamma^A} \text{ and } B^h = (\sum_{i=1}^kC_i^B h_i)^{\gamma^B}

    This scaling law has $5 + 4k$ parameters and predicts that the contribution of NN and DD to the loss depends on the domain weights, making the optimal domain weights compute-dependent.

The authors use the Huber loss to fit the scaling laws, employing a random search and the Basin-hopping algorithm for optimization. The Mean Relative Error (MRE) is used to evaluate the scaling laws by comparing predicted losses against actual losses on a new set of runs with different (N,D,h)(N, D, h) values.

The experimental setup involves pretraining LLMs, NMMs, and LVMs on diverse datasets. For LLMs, the authors use the k=7k=7 domains from SlimPajama. For NMMs, they train on a mixture of text-only data, interleaved multimodal documents, and paired image-caption datasets (k=3k=3). For LVMs, they use a mixture of paired image-caption datasets drawn from four domains (k=4k=4).

The paper presents strong numerical results demonstrating the effectiveness of the proposed scaling laws. The key findings include:

  • Accurate Extrapolation: The scaling laws accurately capture training data and generalize effectively to larger scales with significantly increased values of NN and DD. \Cref{fig:scaling_laws_observed_vs_predicted_multimodal} shows a close alignment between predicted and observed losses for both joint and additive laws, with good extrapolation to larger model sizes. The MRE\% in \cref{tab:mre_results} is consistently low for both laws, with the joint law showing improvement over the additive law.
  • Optimal Domain Weights Estimation: The fitted scaling laws enable accurate estimation of optimal domain weights by solving an optimization problem on the simplex. Models trained with these optimized mixtures consistently outperform alternatives, including uniform mixtures and those used in prior works.
  • Practical Mixture Estimation: The authors demonstrate that the scaling laws can be accurately fitted with small-scale runs, and then used to solve for optimal domain weights, providing a principled approach to mixture estimation compared to ad-hoc methods.

The paper also includes an analysis of the scaling laws, exploring aspects such as the number of runs needed for accurate fitting, the behavior of optimal domain weights when scaling FLOPs, and the validity of the laws with cosine learning rate schedules. They find that only 10-20 runs are needed to fit the scaling laws accurately and that the optimal mixture evolves as a function of the compute budget.

The presented work has significant implications for the field of AI, particularly in the training of large foundation models. By providing a systematic and scalable method for determining optimal data mixtures, this research addresses a critical bottleneck in model development. This approach has practical benefits, such as reduced computational costs and improved model performance, and theoretical implications, such as a deeper understanding of the relationship between data composition and model behavior.

Future developments in this area could explore:

  • Extending the scaling laws to continual pretraining and finetuning scenarios.
  • Predicting downstream task performance directly, rather than relying on generic target loss.
  • Accounting for data repetition and dynamic evolution of domain weights during training.
  • Incorporating additional factors, such as data quality and diversity, into the scaling laws.
Lightbulb On Streamline Icon: https://streamlinehq.com

Continue Learning

We haven't generated follow-up questions for this paper yet.

List To Do Tasks Checklist Streamline Icon: https://streamlinehq.com

Collections

Sign up for free to add this paper to one or more collections.

X Twitter Logo Streamline Icon: https://streamlinehq.com

Tweets

This paper has been mentioned in 12 posts and received 57 likes.

Youtube Logo Streamline Icon: https://streamlinehq.com

alphaXiv

  1. Scaling Laws for Optimal Data Mixtures (60 likes, 0 questions)

Don't miss out on important new AI/ML research

See which papers are being discussed right now on X, Reddit, and more:

“Emergent Mind helps me see which AI papers have caught fire online.”

Philip

Philip

Creator, AI Explained on YouTube