MATES: Model-Aware Data Selection for Efficient Pretraining with Data Influence Models (2406.06046v2)

Published 10 Jun 2024 in cs.CL and cs.LG

Abstract: Pretraining data selection has the potential to improve LLM pretraining efficiency by utilizing higher-quality data from massive web data corpora. Current data selection methods, which rely on either hand-crafted rules or larger reference models, are conducted statically and do not capture the evolving data preferences during pretraining. In this paper, we introduce model-aware data selection with data influence models (MATES), where a data influence model continuously adapts to the evolving data preferences of the pretraining model and then selects the data most effective for the current pretraining progress. Specifically, we collect oracle data influence by locally probing the pretraining model and fine-tune a small data influence model to approximate it accurately. The data influence model then predicts data influence over the whole pretraining corpus and selects the most influential data for the next pretraining stage. Experiments of pretraining 410M and 1B models on the C4 dataset demonstrate that MATES significantly outperforms random data selection on extensive downstream tasks. It doubles the gains achieved by the state-of-the-art data selection approach that leverages larger reference models and reduces the total FLOPs required to reach certain performances by half. Further analyses validate the effectiveness of the locally probed oracle data influence and the approximation with data influence models. Our code is open-sourced at https://github.com/cxcscmu/MATES.

PDF HTML Abstract

Model-Aware Data Selection for Efficient Pretraining with Data Influence Models

This essay presents an in-depth examination of "MATES: Model-Aware Data Selection for Efficient Pretraining with Data Influence Models," a research paper authored by Zichun Yu, Spandan Das, and Chenyan Xiong from Carnegie Mellon University. The central premise of this research is the enhancement of LLM pretraining efficiencies through dynamic, model-aware data selection, which adapts to the evolving needs of the pretraining model over various stages.

Summary and Objectives

The paper addresses a fundamental constraint in scaling up LLMs: compute resources. While model parameter size and data volume are traditionally scaled up in lockstep with available compute, current data selection methodologies are static and overlook the dynamic shifts in model data preferences during pretraining. This static nature leads to suboptimal performance when scaling LLMs. To remedy this, the authors introduce MATES (Model-Aware data selection with daTa influencE modelS), a framework designed to optimize pretraining efficiency by continuously adapting to the pretraining model’s evolving data preferences.

Methodology

The core innovation in MATES is the use of a small, dynamically-tuned data influence model to implement on-the-fly data selection. This model is fine-tuned to approximate "oracle" data preference signals, which are periodically probed from the pretraining model itself. Consequently, MATES selects data that is most effective for the current state of the pretraining process. This setup diverges significantly from traditional static heuristics or influence functions, which do not account for ongoing training dynamics.

The data influence model in MATES follows these steps:

Oracle Data Influence Probing: Periodically, small amounts of “oracle” data influence are collected by evaluating the pretraining model’s performance on a reference task after training on individual data points.
Training the Influence Model: The locally collected oracle data influences are used to train a smaller influence model, typically based on a BERT architecture.
Data Selection: The trained influence model then predicts the influence of all data points in the training corpus, selecting the top-k most effective data points for the next stage of pretraining.

Experimental Results

The authors conducted extensive experiments using the Pythia and C4 dataset, evaluating downstream tasks in both zero- and few-shot settings. Notably, MATES demonstrated superior performance compared to random data selection as well as other existing data selection techniques. Key results include:

Pythia models pretrained with MATES achieved an average zero-shot accuracy improvement of 1.3% across various tasks.
MATES effectively doubled the performance gains over state-of-the-art data selection methods, while also halving the FLOPs required to reach certain performance milestones.

These results validate the hypothesis that data preferences change dynamically during pretraining and that capturing these preferences can materially improve pretraining efficiency.

Theoretical and Practical Implications

The theoretical implications of this research are multifaceted:

Dynamic Data Preferences: The validation of ever-changing data preferences during LLM pretraining highlights the need for dynamically adaptive data selection methodologies.
Data Influence Models: The effective approximation and utilization of oracle data influence through small, efficient models open new avenues for integrating lightweight adaptive processes within large-scale model training.

From a practical standpoint:

Scaling Efficiency: The reduction in compute requirements for pretraining without sacrificing—and indeed often improving—model performance suggests significant cost savings and efficiency gains.
Model Robustness: Improved data selection appears to enhance the robustness of pretrained models across a variety of downstream tasks, potentially leading to broader applicability and more reliable performance of deployed models.

Future Directions

While the results from MATES are promising, several future research directions are evident:

Combinational Data Influence: The current approach relies on individual pointwise influences. Future work may extend this to consider the combinatorial effects of grouped data points on model performance.
Scalability: Although the paper demonstrates effectiveness at moderate scales (410M/1B parameters), there is a need to explore whether these benefits hold at larger scales typical of production LLMs.
Algorithm Refinement: Further experiments to refine the hyperparameters and the data influence model training process could potentially yield even greater efficiencies.

Conclusion

MATES introduces a novel paradigm for data selection in LLM pretraining, highlighting the transformative potential of dynamically adaptive data influence models. The research showcases substantial improvements in pretraining efficiency and downstream task performance, suggesting that dynamic data selection could play a critical role in the future of scalable LLM pretraining. This work not only paves the way for more efficient utilization of computational resources but also opens up new research avenues into the nuanced understanding of dynamic data preferences in model training.