Investigating Data Pruning for Efficient Pretraining of LLMs
This paper explores a critical aspect of large-scale LLM pretraining: the potential benefits of data pruning. The authors challenge the conventional paradigm of more extensive data volumes equating to improved model performance by investigating whether strategic pruning can maintain or even enhance model quality.
Context and Motivation
LLMs, such as GPT and BERT, are traditionally trained on vast datasets, often compiled from noisy web sources. While the prevalent assumption has been more data leads to better models, this paper examines whether intelligently reducing training datasets improves model efficiency without sacrificing performance.
Methodology
The authors employ perplexity, Error L2-Norm (EL2N), and memorization factors as data quality estimators for pruning pretraining data. These metrics allow ranking dataset examples on a perceived quality scale. By retaining various portions of these ranked datasets (e.g., top, middle, bottom subsets), the paper assesses the impact on the trained LLM's performance.
Key Findings
- Perplexity as a Dominant Metric: Surprisingly, the authors find that using perplexity—a relatively simple metric—outperforms more computationally intensive measures like EL2N and memorization. Models trained on pruned datasets ranked by perplexity achieve superior performance, with results showing up to a 2.1% improvement in certain scenarios compared to other methods.
- Retention Rates and Optimal Subsets: Notably, retaining only 30-50% of the original dataset, when scored by perplexity, results in improved LLM performance over retaining larger volumes of data. This suggests that a significant portion of the data possesses limited utility for effective LLMing.
- Impact of Reference Model Scale: Increasing the complexity and size of reference models used to calculate perplexity scores leads to better pruning outcomes. Specifically, a 52B parameter model enables more effective pruning than smaller reference models.
Implications and Future Directions
This paper suggests possible shifts in pretraining strategies for LLMs. By focusing on quality over quantity, researchers and practitioners might curtail computational costs and environmental impacts of AI training regimens. The finding positions perplexity as a practical and computationally efficient tool for data selection criteria, potentially reshaping future approaches to building robust and efficient LLMs.
Further exploration might include refining metric combinations or exploring novel metrics for even more precise pruning. Additionally, evaluating the effects across diverse LLM architectures and extending experiments to other natural language processing domains could provide deeper insights.
Overall, this paper underlines a need for reconsideration of data utilization in pretraining LLMs, advocating for strategic, quality-focused data management to drive the next wave of advancements in AI LLMs.