Capacity-Aware Mixture Law Enables Efficient LLM Data Optimization
Abstract: A data mixture refers to how different data sources are combined to train LLMs, and selecting an effective mixture is crucial for optimal downstream performance. Existing methods either conduct costly searches directly on the target model or rely on mixture scaling laws that fail to extrapolate well to large model sizes. We address these limitations by introducing a compute-efficient pipeline for data mixture scaling. First, we propose CAMEL, a capacity-aware mixture law that models validation loss with the nonlinear interplay between model size and mixture. We also introduce a loss-to-benchmark prediction law that estimates benchmark accuracy from validation loss, enabling end-to-end performance prediction for the target model. Next, we study how to allocate a fixed compute budget across model scales to fit the law and reduce prediction error. Finally, we apply our method to Mixture-of-Experts models with up to 7B-A150M parameters to fit the law, and verify the optimal mixture derived from the law by extrapolating to a 55B-A1.2B target model. Compared to prior methods, we reduces mixture optimization costs by 50\% and improves downstream benchmark performance by up to 3\%.
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Top Community Prompts
Explain it Like I'm 14
Overview
This paper is about teaching LLMs more effectively by choosing the right “mix” of training data. Think of training data like ingredients in a recipe: English text, Chinese text, code, math problems, and knowledge-heavy materials (like textbooks). The authors introduce a new method, called CAMEL (Capacity-Aware Mixture Law), that helps choose the best mix of these ingredients for a model of a given size—without spending tons of computing power trying every possibility.
What questions did the researchers ask?
They focused on three simple questions:
- How can we predict the best data mix for a model, especially when the model gets bigger?
- Can we link “practice test performance” (validation loss) to “real test performance” (benchmark scores) in a reliable way?
- With limited computing power, where should we spend our efforts—on small models, big models, or somewhere in between—to learn the best data mix most efficiently?
How did they try to answer them?
They built a practical, compute-efficient pipeline with three key ideas. Here they are in everyday terms:
- Capacity-aware rule for data mixing (CAMEL)
- Analogy: Imagine your model is a team with limited time (capacity). It needs to split that time across different subjects (English, Chinese, code, math, knowledge). As the team grows (bigger model), how it splits time should change—but not in a simple, proportional way. Some subjects benefit more than others as the team scales.
- The authors create a mathematical “rule of thumb” (a scaling law) that predicts how the model’s “practice mistakes” (validation loss) change depending on:
- the mix of training data (how much of each subject you include), and
- the model’s size (how big the team is).
- Unlike earlier methods that treated “data mix” and “model size” separately, CAMEL combines them, because they affect each other in tricky ways.
- From practice mistakes to real scores
- Validation loss = how often the model messes up on a held-out set while training (like a practice test).
- Benchmark accuracy = how well it does on real-world tests (like MMLU, ARC, HumanEval).
- They learn a simple, smooth mapping (a logistic curve) that turns several validation losses (across different validation sets) into predicted benchmark scores. This lets them predict actual scores for any given data mix and model size.
- Smarter use of compute: the “hourglass” sampling strategy
- Training every mix on large models is too expensive. So they ask: how do we split our experiments across small, medium, and large models to learn the most?
- They test different strategies and find “hourglass” works best: focus on the smallest and biggest models, and do fewer runs in the middle. This gives the most accurate predictions for the least compute.
To put it all together, their approach is:
- Train several smaller models on different data mixes.
- Fit the CAMEL law (to predict validation loss from mix + model size).
- Fit the loss-to-benchmark mapping (to turn losses into scores).
- Use both to predict the best data mix for a much larger target model—without fully training that big model on many mixes.
What did they find, and why is it important?
Key findings:
- More accurate predictions with CAMEL
- CAMEL predicts validation loss better than previous methods (DML and SODM), especially when you don’t have many experiments. Better prediction means you can trust it to choose a good data mix without brute-force searching.
- Accurate benchmark predictions from loss
- Their loss-to-benchmark mapping predicts real test scores well across many benchmarks. This means “practice mistakes” really can be used to forecast “exam scores,” if you combine the information properly.
- The hourglass strategy saves compute
- When computing power is limited (which it usually is), prioritizing very small and very large models reduces prediction error more than evenly spreading efforts across all model sizes.
- Real-world wins on big models
- They tested CAMEL on very large models (up to around 55B parameters in a Mixture-of-Experts setup).
- Result: they found high-quality data mixes using less than the compute of one full training pass on the target model.
- Compared to prior methods, they cut mixture-search costs by about 50% and improved benchmark performance by up to 3%.
- How the best mix changes with model size
- As models get bigger, the best mix shifts: they suggest giving more weight to knowledge-rich data and relatively less to math and code. This offers a simple rule of thumb for future training.
Why this matters:
- Training huge models is extremely expensive. If you can pick a better data mix with half the compute and get better scores, that’s a big deal for both research labs and companies.
- A clearer link from validation loss to benchmark results helps teams make decisions earlier and cheaper in the training process.
What’s the potential impact?
- Faster, cheaper model improvement
- Teams can explore and choose good data recipes for large models without exhaustive, costly trials, saving money and time.
- Better generalization
- The mixtures found by CAMEL improved not just on the target benchmarks but also on unseen tests, suggesting the method avoids overfitting to narrow goals.
- Practical guidance for scaling
- The observed trend—larger models benefit more from knowledge-heavy data—can guide how to allocate training data as models grow.
- Foundation for future tools
- The framework can be extended with new formulas or even non-parametric models, and it sets the stage for adaptive strategies that decide, on the fly, which model sizes and data mixes to try next.
In short, CAMEL offers a smarter way to mix training data that adapts to the size of the model. It predicts real test performance from cheaper signals, uses compute wisely, and delivers better results with less effort.
Knowledge Gaps
Knowledge gaps, limitations, and open questions
Below is a concise list of what remains missing, uncertain, or unexplored in the paper; each item is framed to be directly actionable for future research.
- Generalization across architectures: validate CAMEL on dense LLMs and alternative MoE designs (different expert counts, routing/topologies) to test whether the law and derived mixtures transfer beyond DeepSeek-V3-style MoE.
- Extrapolation limits: assess CAMEL’s accuracy when extrapolating to substantially larger models than 55B-A1.2B and to smaller target models; characterize the regime where extrapolation breaks down.
- Compute metric alignment: replace/augment “activated parameters” with FLOPs- or tokens-based compute to ensure fair comparisons across dense and sparse models; study how the law changes under different compute definitions.
- Missing training-steps dimension: extend the law to jointly model mixture, model size, and training steps/tokens (tri-variate law) and compare against BiMix-style formulations that explicitly include training progression.
- Assumption validity (near-homogeneous exponents): empirically test Assumption 2.1 (similar learning-difficulty exponents across intrinsic domains) across more heterogeneous domains (dialogue, safety, multimodal text) and quantify violations’ impact on prediction error.
- Assumption validity (stable intrinsic-domain weights): stress-test the “stable intrinsic-domain weights” assumption under mixtures with highly skewed or long-tailed domain ratios and in settings with many more domains.
- Intrinsic-domain identification: develop and evaluate methods to estimate the number of intrinsic domains k and the domain-profile matrix t_ij from data (e.g., unsupervised clustering or topic modeling), rather than assuming them.
- Identifiability and parameter degeneracy: analyze whether CAMEL’s parameters (a_i, B_i, K_i) are uniquely identifiable from available data and quantify equivalence classes or degeneracies that could hinder interpretability or optimization stability.
- Cross-domain interaction effects: move beyond mixture-weighted additive losses to model synergies/interference between domains (e.g., Math improving Code) and evaluate whether such interactions materially improve predictions.
- Static vs time-varying mixtures: investigate schedule-based mixture policies (time-varying r over training) and compare to CAMEL’s static mixture optimization for mid-training.
- Robustness to training recipes: test sensitivity to optimizer (e.g., AdamW vs Muon), LR schedules, batch sizes, precision, dropout, and data ordering to ensure the law is not recipe-specific.
- Dependence on “reference recipe” neighborhood: quantify how far CAMEL can generalize beyond the ±20% per-domain perturbations around the human-designed baseline; evaluate on mixtures far outside that neighborhood.
- Sampling-strategy optimality: provide a formal analysis of when and why the hourglass allocation is optimal (noise models, bias-variance trade-offs, cost functions), and compare with adaptive/active designs (e.g., sequential Bayesian experimental design).
- Fairness of baseline comparison: run baselines (DML, SODM) under the hourglass sampling strategy to isolate gains due to the law vs due to sampling; include ablations controlling for identical sampling budgets and point placements.
- Uncertainty quantification: report confidence intervals, seed variance, and sensitivity analyses for both loss prediction and benchmark accuracy predictions; incorporate uncertainty into mixture optimization (e.g., robust or risk-averse objectives).
- Loss-to-benchmark mapping robustness: evaluate the logistic mapping under out-of-distribution checkpoints, across families of benchmarks, and with stricter train/validation separation across training runs to mitigate correlated checkpoint leakage.
- Alternative mappings and targets: compare the logistic link to other forms (e.g., probit, piecewise-linear, monotone splines) and extend to non-accuracy metrics (F1, calibration, pass@k for code, chain-of-thought consistency).
- Dimensionality reduction to L_b: quantify information loss when compressing multi-loss vectors L into the single proxy L_b = kT L; compare against multi-output or multi-task mixture laws that retain the full loss vector.
- Safety, bias, and toxicity impacts: measure how mixture optimization affects safety benchmarks, toxicity rates, and demographic biases; add constraints or regularizers to prevent harmful trade-offs.
- Multi-lingual coverage: extend experiments beyond English/Chinese to more languages (low-resource, morphologically rich) and study whether CAMEL’s assumptions and gains hold in highly multilingual settings.
- Data quality within domains: incorporate quality signals (deduplication, contamination, heuristic “quality” scores) into the law to understand whether CAMEL should weight high-quality sub-corpora differently within a domain.
- Domain overlap and leakage: quantify overlap and cross-contamination between domains (e.g., “Knowledge” vs English/Chinese) and study how overlap biases t_ij estimation and downstream predictions.
- Constraints in mixture optimization: analyze whether the optimizer tends to push r to simplex vertices (extreme mixtures), and explore regularization or constraints (e.g., minimum coverage per domain) to avoid brittle solutions.
- Sample complexity: characterize how many mixtures per scale are needed to fit CAMEL to a target error level; derive bounds or empirical curves for error vs number of (M, r) evaluations.
- Generalization to pretraining-from-scratch and continual pretraining: test CAMEL outside mid-training (early pretraining phases and long-horizon continual pretraining) and compare to domain-specific continual scaling laws (e.g., D-CPT/CMR).
- Production relevance and real-world tasks: evaluate whether CAMEL-optimized mixtures improve performance on product-facing tasks (retrieval-augmented QA, long-context tasks, tool use), not just academic benchmarks.
- Tokenizer/vocabulary effects: study sensitivity to tokenization differences (e.g., unigram vs BPE vs sentencepiece), especially for multilingual and code domains.
- Routing/capacity allocation in MoE: directly measure how expert routing and capacity allocation change with mixture r and model scale M, and validate CAMEL’s capacity-allocation hypothesis with routing statistics.
- Weight selection for multi-benchmark objective: analyze sensitivity to user-defined weights w (Pareto fronts, multi-objective optimization) and provide principled methods to select or learn w.
- Reproducibility and release: provide code, fitted parameters, and mixture recipes to enable independent replication and downstream use; include standardized cost accounting for “<1× full training pass” claims.
Practical Applications
Immediate Applications
- Capacity-aware data mixture optimizer for mid-training LLMs
- What it does: Fits CAMEL on a ladder of small-to-medium models, then solves for the target model’s optimal domain weights to maximize a weighted benchmark objective. Delivers up to ~50% lower mixture-search compute and ~3% higher downstream accuracy than common baselines.
- Sectors: Software/AI labs, cloud AI providers, foundation-model startups.
- Tools/workflows:
- Implement CAMEL’s mixture-to-loss fit and the loss-to-benchmark logistic mapping.
- Use the “hourglass” sampling plan to choose which (model size, mixture) runs to train under a fixed budget.
- Automate a solve step r*(M; w) for target size M and task weights w, then run a single mid-training “cooldown” pass with the chosen mixture.
- Assumptions/dependencies:
- Access to multi-domain training corpora and a set of validation datasets that correlate with target benchmarks.
- The law’s assumptions (near-homogeneous learning exponents, mild variation of induced domain weights) hold well enough for your domains.
- Best validated so far on MoE-style architectures and five-domain mid-training; dense or very different domains may require refitting/validation.
- Compute-aware experiment planner for mixture search
- What it does: Replaces rectangular sampling with the paper’s “hourglass” allocation across scales (emphasize smallest and largest models). Reduces extrapolation error for a fixed compute budget.
- Sectors: MLOps, software/AI research.
- Tools/workflows: Scheduler that allocates GPU-hours across a model ladder and enumerates small, diverse mixtures at extremes before filling mid-scales.
- Assumptions/dependencies: Requires ability to spin up models at multiple scales and to track per-run costs; savings depend on disciplined adherence to the allocation plan.
- Loss-to-benchmark proxy for early selection and stopping
- What it does: Uses the multi-loss logistic mapping to predict benchmark accuracy from validation losses during training, enabling early stopping, checkpoint selection, and mixture triage without full eval sweeps.
- Sectors: Software/AI labs, cloud AI providers.
- Tools/workflows:
- Add lightweight loss probes (e.g., losses on code, math, knowledge validation sets).
- Fit the benchmark mapping once, then score checkpoints continuously.
- Assumptions/dependencies: Mapping quality depends on coverage of validation losses and their stability; mis-specified or too few validation sets can degrade predictions.
- Domain-specialized mixture rebalancing for vertical LLMs
- What it does: For healthcare, legal, finance, etc., learn mixtures that optimally trade off general knowledge vs. domain corpora for a given target size.
- Sectors: Healthcare, legal, finance, education.
- Tools/workflows:
- Healthcare: Mix biomedical papers (e.g., PubMed), clinical-style text (synthetic or de-identified), general knowledge; validate with losses tied to MedQA/clinical QA and map to benchmarks.
- Legal: Mix statutes, case law, treatises; validate with losses tied to legal QA/reasoning tasks.
- Finance: Mix filings, research, news; validate toward financial QA, math reasoning.
- Assumptions/dependencies: High-quality, licensed domain data; benchmark proxies must represent target tasks; access controls and compliance for sensitive text.
- Multilingual mixture tuning aligned to target scale and markets
- What it does: Adjust ratios across languages and knowledge vs. reasoning as model size grows (paper observes knowledge weighting growing with size).
- Sectors: Global product localization, public-sector NLP.
- Tools/workflows: Fit CAMEL across language buckets (e.g., EN, ZH, low-resource groups) and adjust per target deployment region.
- Assumptions/dependencies: Representative validation losses per language; careful handling of low-resource languages to avoid overfitting limited data.
- Benchmark-aware mid-training curricula
- What it does: Translate r*(M; w) into a short “cooldown” schedule (e.g., last 10–20B tokens) to sharpen skills needed for deployment tasks.
- Sectors: Software/AI, edtech, code assistants.
- Tools/workflows: Create a terminal curriculum that upweights math/code/knowledge in the final phase based on the optimized mixture.
- Assumptions/dependencies: Requires that final-phase reweighting materially affects downstream metrics for the target scale, as observed in mid-training regimes.
- Cost/emissions planning and reporting
- What it does: Use hourglass sampling + CAMEL to cut mixture-search runs and report reduced GPU-hours and emissions in model cards.
- Sectors: Industry labs, policy/compliance teams.
- Tools/workflows: Integrate the compute plan and realized savings into governance dashboards and reproducibility statements.
- Assumptions/dependencies: Traceable run accounting; organizational buy-in for compute budgeting.
- Academic methodology for data-mixture causal studies
- What it does: Provides a principled, capacity-aware framework to test how domain weights and model size jointly affect generalization.
- Sectors: Academia.
- Tools/workflows: Use CAMEL fits and ablations on intrinsic domains k to design controlled studies; share mixtures and fitted parameters for reproducibility.
- Assumptions/dependencies: Access to a model ladder and standardized evaluation suites.
- Lightweight open-source utilities
- What it does: Package a small library that:
- Fits the capacity-aware mixture law on (M, r, L) triplets.
- Fits the multi-loss logistic benchmark mapping.
- Solves the mixture optimization given task weights.
- Sectors: Open-source LLM training stacks (PyTorch, JAX), MLOps.
- Tools/workflows: CLI and Python API; export optimized mixtures as JSON for dataloaders.
- Assumptions/dependencies: Community datasets and validation splits; tests on both dense and MoE variants recommended.
- “Mixture optimization as a service” for smaller teams
- What it does: A hosted service that runs the hourglass plan on customer-provided data buckets and returns an optimized mid-training recipe for a stated target size and task portfolio.
- Sectors: Startups, SMEs, applied AI teams.
- Tools/workflows: Managed runs, secure data handling, deliverables include r*(M; w), predicted task gains, and cost savings.
- Assumptions/dependencies: Data governance and privacy; customers must specify target benchmarks and acceptable proxies.
Long-Term Applications
- Closed-loop, online mixture control during training
- What it could do: Dynamically reallocate domain sampling based on real-time loss-to-benchmark predictions and inferred capacity allocation, rather than fixed late-phase “cooldown” schedules.
- Sectors: Software/AI at scale, cloud training platforms.
- Tools/workflows: Controllers that adjust dataloader weights every N steps; guardrails against instability.
- Assumptions/dependencies: Stable online estimators; additional research on control policies and convergence.
- Non-parametric or hybrid laws for richer domains
- What it could do: Replace or augment CAMEL’s parametric form with simple non-parametric fits or mixtures of experts across domains to better capture non-linear interactions and phase transitions.
- Sectors: Research labs; high-stakes verticals with complex data (e.g., clinical + multimodal).
- Tools/workflows: Kernel regressors or spline fits over (M, r); model-selection tooling.
- Assumptions/dependencies: More data to avoid overfitting; careful uncertainty estimation.
- Cross-modal mixture scaling for multimodal foundation models
- What it could do: Extend capacity-aware mixture laws to text–image–audio–video mixtures and synthetic vs. real data blends.
- Sectors: Robotics, autonomous systems, media AI, healthcare imaging.
- Tools/workflows: Multimodal validation-loss probes; per-modality ladders; mixed-modality curricula.
- Assumptions/dependencies: New loss proxies and benchmarks per modality; modality-specific capacity dynamics.
- Task- and user-personalized pretraining curricula
- What it could do: Enterprise- or team-specific mixtures (e.g., internal docs + general knowledge) targeted to a deployment’s task mix and model size, potentially refreshed as task distributions drift.
- Sectors: Enterprise AI, productivity suites, customer support.
- Tools/workflows: Periodic re-fitting with updated task weights; scheduled continual pretraining windows.
- Assumptions/dependencies: Safe and compliant use of proprietary text; data drift monitoring.
- Active mixture exploration (mixture Bayesian optimization)
- What it could do: Use uncertainty from the fitted law to actively propose the next (M, r) experiments that maximally reduce prediction error per GPU-hour.
- Sectors: MLOps, AutoML for data.
- Tools/workflows: Bayesian optimization over the simplex with multi-fidelity (scale-aware) acquisition.
- Assumptions/dependencies: Calibrated uncertainty; orchestration overhead.
- Data marketplace pricing by marginal mixture contribution
- What it could do: Use the fitted law to estimate the marginal downstream value of an additional tranche from domain i at target M, informing procurement and licensing.
- Sectors: Data vendors, AI procurement, finance.
- Tools/workflows: “Data ROI” calculators; contract clauses tied to measurable gains.
- Assumptions/dependencies: Law fidelity on unseen sources; legal and ethical sourcing.
- Regulatory and governance standards around data mixtures
- What it could do: Encourage/require disclosure of mixture design, sampling strategy (e.g., hourglass vs. uniform), and compute used, to support comparability and environmental reporting.
- Sectors: Policy, standards bodies, public-sector AI.
- Tools/workflows: Model cards with mixture and compute statements; audit checklists.
- Assumptions/dependencies: Community consensus; willingness to share mixture details without leaking sensitive IP.
- Privacy- and safety-aware mixture shaping
- What it could do: Optimize mixtures to reduce reliance on sensitive or risky domains while preserving performance by upweighting safer substitutes identified as high-yield in the law.
- Sectors: Healthcare, legal, public-sector AI.
- Tools/workflows: Safety-weighted objective functions; red-team validation integrated into the optimization loop.
- Assumptions/dependencies: Reliable safety/PII labels; trade-off analysis on performance vs. risk.
- Forecasting tools for data acquisition and compute planning
- What it could do: Given target tasks and planned model size, forecast expected gains from additional data in each domain and the compute needed to fit/validate; support portfolio and budget planning.
- Sectors: AI program management, CFO/COO offices.
- Tools/workflows: Scenario planners using CAMEL parameters and uncertainty bands.
- Assumptions/dependencies: Stable task mix; historical data on costs and throughput.
- Application to embodied and simulation-heavy training
- What it could do: Optimize mixtures of synthetic simulation logs vs. real-world traces (and code/planning text) as robot/control model capacity scales.
- Sectors: Robotics, autonomous driving, industrial automation.
- Tools/workflows: Domain buckets per environment; validation losses linked to sim2real benchmarks.
- Assumptions/dependencies: New proxies bridging loss to embodied metrics; careful treatment of distribution shift.
Notes on feasibility across applications
- Transferability: CAMEL is validated up to a 55B-A1.2B MoE target; dense models and very large scales likely work but warrant verification.
- Proxies matter: The loss-to-benchmark mapping hinges on picking validation losses that actually correlate with target benchmarks; inadequate coverage degrades accuracy.
- Data quality: The law assumes relatively stable intrinsic domain weights and similar learning exponents; highly heterogeneous or noisy domains may violate assumptions, requiring more flexible models.
- Compute ladder: Benefits appear when you can train a small ladder of model sizes; teams without this capability may need shared/public ladders or pooled experiments.
Glossary
- Activated Model Parameters: the number of parameters actually used during inference/training in sparse or MoE models, often smaller than total parameters. "Activated Model Parameters"
- Benchmark-proxy loss: a scalar constructed from validation losses to stand in for benchmark performance in scaling laws. "Benchmark-proxy loss: Lb(r, M) = k, L"
- bfloat16 precision: a 16-bit floating-point format with a wider exponent than FP16, used to speed up training while keeping numerical range. "Training was conducted in bfloat16 precision"
- Capacity allocation: the implicit distribution of a model’s parameter budget across tasks or domains during training. "Optimizing this problem induces a capacity allocation depends on the data mixture and the model size"
- Capacity-Aware Mixture Law (CAMEL): a scaling law that jointly models the effects of data mixture and model size on loss/accuracy. "We refer to this formulation as the Capacity-Aware Mixture Law (CAMEL)."
- Chinchilla-style scaling law: a family of scaling laws that relate model/data/compute to loss, inspired by the Chinchilla paper’s compute-optimal findings. "Ye et al. (2025) extrapolate to larger model sizes using a Chinchilla-style scaling law"
- Compute-optimal number of tokens: the training token count predicted to yield the best performance for a given model size under compute constraints. "We first pretrain the model up to its compute-optimal number of tokens"
- Continual pretraining: continued training of a pretrained model on new data to adapt capabilities without training from scratch. "For continual pretraining, Gu et al. (2024) and Que et al. (2024) study how to balance general and domain-specific data."
- Cooldown stage: a later training phase with additional tokens and adjusted data mixture to refine capabilities. "This is followed by a cooldown stage with an additional 20B tokens"
- Data Mixing Law (DML): a parametric law that models validation loss as a function of data mixture ratios. "Ye et al. (2025) propose a representative Data Mixing Law (DML)"
- Distributionally robust optimization (Group DRO): an approach that optimizes worst-group performance to improve robustness across data domains. "group distributionally robust optimization (Group DRO)"
- Domain-profile vector: a vector describing how an intrinsic domain is represented across datasets in the mixture. "We define ti = (ti1, ... , tin) E R" as the domain-profile vector of intrinsic domain i across datasets."
- Exponential decay law: a parametric relationship where error decreases exponentially with a predictor like loss. "Gadre et al. (2024) propose an ex- ponential decay law directly linking loss to benchmark error"
- Hourglass sampling strategy: a compute-aware design that concentrates samples at the smallest and largest model scales, fewer in the middle. "Hourglass. Allocate points at both extremes first and then fill toward the center."
- Intrinsic domains: latent domains that underlie datasets and validation sets, used to factor mixture effects. "we assume that both the training data and the validation data can be expressed through k in- trinsic domains."
- Irreducible component of training loss: the portion of loss that cannot be reduced by training under the modeled assumptions. "represents the irreducible component of training loss."
- Logistic form: a sigmoidal function used to map losses to probabilities/accuracies. "we adopt a logistic form and extend it to a multi-dimensional setting."
- Loss-to-benchmark prediction law: a mapping from validation losses to downstream benchmark accuracies. "we introduce a loss-to-benchmark pre- diction law"
- Mixture distribution: the probability distribution over data formed by sampling from multiple domains according to mixture weights. "leading to the mixture distribution p(x | r)"
- Mixture-induced domain weights: effective weights of intrinsic domains implied by the chosen dataset mixture. "Intrinsic domains and mixture-induced domain weights."
- Mixture-of-Experts models: architectures that route tokens to a subset of expert networks for computational efficiency and capacity. "Mixture-of-Experts models with up to 7B-A150M parameters"
- Model-size-aware scaling laws: scaling laws that explicitly incorporate model size when predicting performance versus data mixture. "two model-size-aware scaling laws"
- Muon optimizer: an AdamW-like optimizer variant used for training large models. "Optimization used the Muon optimizer (Liu et al., 2025), a variant of AdamW"
- Power law: a relationship where one quantity varies as a power of another, used here to model loss versus allocated capacity. "follows a power law (Hoffmann et al., 2022)."
- Probability simplex: the set of nonnegative vectors that sum to one, representing valid mixture ratios. "probability simplex An-1 = {rERRO : Eri = 1}."
- Scaling Law for Optimal Data Mixture (SODM): a model-size-aware formulation to predict optimal data mixtures. "which we refer to as SODM (Scaling Law for Optimal Data Mixture)"
- Sigmoidal mapping: a smooth S-shaped transformation (e.g., logistic) from loss to accuracy. "then apply a sigmoidal mapping to predict benchmark ac- curacy"
- Stable intrinsic-domain weights: an assumption that induced domain weights vary only slightly across mixtures. "Stable intrinsic-domain weights."
Collections
Sign up for free to add this paper to one or more collections.