DataDecide: How to Predict Best Pretraining Data with Small Experiments
(2504.11393v1)
Published 15 Apr 2025 in cs.LG and cs.CL
Abstract: Because LLMs are expensive to pretrain on different datasets, using smaller-scale experiments to decide on data is crucial for reducing costs. Which benchmarks and methods of making decisions from observed performance at small scale most accurately predict the datasets that yield the best large models? To empower open exploration of this question, we release models, data, and evaluations in DataDecide -- the most extensive open suite of models over differences in data and scale. We conduct controlled pretraining experiments across 25 corpora with differing sources, deduplication, and filtering up to 100B tokens, model sizes up to 1B parameters, and 3 random seeds. We find that the ranking of models at a single, small size (e.g., 150M parameters) is a strong baseline for predicting best models at our larger target scale (1B) (~80% of com parisons correct). No scaling law methods among 8 baselines exceed the compute-decision frontier of single-scale predictions, but DataDecide can measure improvement in future scaling laws. We also identify that using continuous likelihood metrics as proxies in small experiments makes benchmarks including MMLU, ARC, HellaSwag, MBPP, and HumanEval >80% predictable at the target 1B scale with just 0.01% of the compute.
This paper introduces DataDecide, an extensive open suite of models, data, and evaluations designed to help researchers and practitioners determine the best pretraining data for LLMs (LMs) by using smaller-scale, less expensive experiments (Magnusson et al., 15 Apr 2025). The core problem addressed is the high cost of pretraining LMs on various datasets, making it crucial to find reliable methods for predicting which data will yield the best-performing large models based on small-scale trials.
DataDecide Suite:
To enable empirical paper, the authors pretrained over 1,050 models (resulting in over 30,000 checkpoints) across:
25 data recipes: These include popular corpora like Dolma, C4, RefinedWeb, FineWeb, and DCLM, as well as variations involving source mixing, deduplication, filtering, and ablations of specific domains (e.g., code, math, Reddit).
14 model sizes: Ranging from 4 million to 1 billion parameters.
Training tokens: Up to 100 billion tokens, maintaining a token-to-parameter ratio of 100 (5x Chinchilla-optimal).
3 random seeds: For the largest (1B parameter) models, full reruns were conducted. For smaller models, second and third seed runs were terminated early (at 25% of the target compute budget) to save compute while still allowing for variance assessment.
All models, pretraining corpora, and evaluation results are released on HuggingFace.
Methods:
The paper evaluates two main approaches for predicting large-scale performance from small-scale experiments:
Ranking Single Scale Experiments (Single Scale): This common practice involves running experiments with different data recipes at a single, small model size. The data recipe performing best at this small scale is presumed to be the best for the larger target scale.
Extrapolating Scaling Laws (Multi Scale): This involves fitting scaling laws to the performance of models trained on a data recipe across multiple small scales. The data recipe whose scaling law predicts the highest performance at the target scale is chosen. The paper uses a two-step approach:
L(C)=CαA+E: Predicts LLMing loss (L) from compute (C).
Acc(L)=1+e−k(L−L0)a+b: Predicts downstream accuracy (Acc) from the predicted loss (L).
Eight baseline scaling law methods were tested.
Prediction Metrics:
Prediction Error: Measures the relative or absolute difference between predicted and actual downstream performance.
Decision Accuracy: The primary metric, defined as the percentage of correctly identified "winner" data recipes when comparing all pairs of recipes. The "true" winner is determined by the mean performance of 1B parameter models over 3 random seeds. This is similar to Kendall's τ.
Percent of Target Compute Budget ($): Measures the FLOPs used for small-scale experiments as a percentage of the FLOPs for the target (1B parameter) model training.
Performance Evaluation:
OLMES Suite: Performance is evaluated on 10 multiple-choice question-answering benchmarks (MMLU, HellaSwag, ARC Challenge/Easy, PIQA, CommonsenseQA, SocialIQA, OpenBookQA, BoolQ, WinoGrande) using the "cloze" formulation (CF) accuracy.
Proxy Metrics: At smaller scales, continuous metrics are explored as proxies for the target discrete accuracy. These include:
Correct Prob: Average probability of the correct answer.
Margin: Average difference between correct answer probability and the most likely incorrect answer probability.
Norm Correct Prob: Correct Prob normalized by the sum of probabilities of all possible answers.
Total Prob: Average sum of probabilities of all (correct and incorrect) answer options.
These can be normalized by token or character length (character length normalization was generally found to be optimal).
Key Findings and Recommendations:
Compute vs. Decision Accuracy:
There's a positive, roughly log-linear relationship between experimental compute and decision accuracy. More compute generally leads to better decisions.
Intermediate checkpoints can provide decision accuracy comparable to fully trained small models using an equivalent amount of compute.
The compute needed for good predictions varies significantly by task. MMLU and ARC Easy are predictable with much less compute (0.01% of target compute for >80% decision accuracy) than HellaSwag. Some tasks like SocialIQA are hard to predict reliably.
Scaling Laws vs. Single Scale Ranking:
Ranking models at a single, small scale (e.g., 150M parameters) is a strong baseline, correctly predicting pairwise comparisons ~80% of the time for the 1B target scale.
None of the 8 baseline scaling law methods tested significantly outperformed the compute-decision accuracy frontier set by single-scale predictions.
The authors suggest DataDecide can be a benchmark for future, improved scaling law methods.
Effectiveness of Proxy Metrics:
At small scales, continuous proxy metrics, particularly Correct Prob and Total Prob (average likelihood of correct or all answer options), often serve as better or equivalent predictors of decisions than using the discrete Accuracy metric itself (which is the target metric at the large scale).
For 5 out of 10 OLMES tasks, Correct Prob and Total Prob improved decision accuracy at smaller scales.
These likelihood-based metrics sometimes even decrease in decision accuracy as compute approaches the target, where Accuracy and other metrics like Norm Correct Prob and Margin (which penalize incorrect answers) tend to take over.
Characterizing Benchmark Predictability:
Better decision accuracy on a task is associated with:
Low run-to-run variance (noise).
A wide spread of performance values across different data recipes.
Using Correct Prob as a proxy often improves one or both of these characteristics.
For code generation tasks (HumanEval, MBPP), which are difficult for small models to achieve non-trivial accuracy on, switching the proxy metric from Accuracy to Correct Prob at small scales dramatically improved decision accuracy (from trivial to ~80%) for predicting target-scale Accuracy. This benefit was not observed for math benchmarks like GSM8K.
Practical Implications and Contributions:
Cost Reduction: Provides guidance on how to make informed decisions about pretraining data with significantly reduced computational cost by using small-scale experiments.
Efficient Experimentation: Highlights that for many tasks, simple single-scale ranking with appropriate proxy metrics (like Correct Prob) can be highly effective.
Benchmark for Future Research: The DataDecide suite (models, data, evaluations) is a valuable public resource for researchers to develop and test new data selection strategies, scaling laws, and evaluation metrics without needing to repeat extensive pretraining.
Task-Specific Strategies: Shows that the optimal approach for data decisions can depend on the target downstream task.
Proxy Metric Utility: Demonstrates that using continuous likelihood-based proxy metrics at small scales can make tasks that are otherwise too difficult (e.g., code generation) predictable, allowing for more reliable data decisions.
Limitations:
The paper uses a single token-to-parameter ratio (100:1).
A specific set of 14 model configurations (4M-1B parameters) and 25 data recipes were used, which may not cover all future scenarios.
Evaluations focused on multiple-choice QA tasks; other task types might yield different results.
In essence, DataDecide offers a framework and a rich dataset to empirically investigate how to best choose pretraining data for LMs. It suggests that ranking based on single small-scale experiments using continuous likelihood metrics as proxies can be a surprisingly effective and compute-efficient strategy for predicting which data will lead to better large models.