Predictive Data Selection: The Data That Predicts Is the Data That Teaches (2503.00808v3)

Published 2 Mar 2025 in cs.CL

Abstract: LLM pretraining involves training on extensive corpora, where data quality plays a pivotal role. In this work, we aim to directly estimate the contribution of data during pretraining and select pretraining data in an efficient manner. Specifically, we draw inspiration from recent findings showing that compression efficiency (i.e., the normalized loss) of diverse models on certain text correlates strongly with their downstream performance, when the text domain aligns with the downstream benchmarks(Huang et al., 2024). Building on this observation, we hypothesize that data on which model losses are predictive of downstream abilities also contribute effectively to learning. To leverage this insight, we introduce predictive data selection (PreSelect), a lightweight and efficient data selection method that requires training and deploying only a fastText-based scorer. Through comprehensive experiments with 1B and 3B parameter models, we demonstrate that models trained on 30B tokens selected with PreSelect surpass the performance of the vanilla baseline trained on 300B tokens, achieving a 10x reduction in compute requirements. Furthermore, PreSelect significantly outperforms other competitive data selection baselines, such as DCLM and FineWeb-Edu on a scale of 3B models trained on 100B tokens. We open-source our trained data selection scorer along with the curated datasets at https://github.com/hkust-nlp/PreSelect.

Summary

The paper introduces a novel method to select high-quality pretraining data by quantifying its predictive strength using compression efficiency via language model loss rankings and a matching score.
A scalable approach trains a fastText classifier on a seed set, enabling efficient document selection over massive datasets and achieving up to a 10reduction in computation while outperforming baselines.
Extensive experiments demonstrate substantial improvements across models (400M-3B parameters), including a 5.3% absolute gain on key tasks and a 10% increase on Arc-Easy compared to random selection and methods like DCLM.

This work introduces a method for efficiently selecting high-quality pretraining data based on the predictive strength measured via compression efficiency.

It quantifies a document’s predictive strength by computing normalized LLM loss rankings and comparing them with downstream performance ranks using a matching score.
A fastText-based classifier is trained on a curated seed set to enable scalable document-level selection over massive corpora, yielding up to 10× reduction in computation while surpassing baselines.
Extensive experiments with models ranging from 400M to 3B parameters show significant absolute improvements (e.g., 5.3% on key tasks, 10% gains on Arc-Easy, and marked BPC reductions for math/code) compared to random selection and existing methods like DCLM.