Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
119 tokens/sec
GPT-4o
56 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Perplexed by Perplexity: Perplexity-Based Data Pruning With Small Reference Models (2405.20541v1)

Published 30 May 2024 in cs.LG and cs.CL

Abstract: In this work, we investigate whether small LLMs can determine high-quality subsets of large-scale text datasets that improve the performance of larger LLMs. While existing work has shown that pruning based on the perplexity of a larger model can yield high-quality data, we investigate whether smaller models can be used for perplexity-based pruning and how pruning is affected by the domain composition of the data being pruned. We demonstrate that for multiple dataset compositions, perplexity-based pruning of pretraining data can \emph{significantly} improve downstream task performance: pruning based on perplexities computed with a 125 million parameter model improves the average performance on downstream tasks of a 3 billion parameter model by up to 2.04 and achieves up to a $1.45\times$ reduction in pretraining steps to reach commensurate baseline performance. Furthermore, we demonstrate that such perplexity-based data pruning also yields downstream performance gains in the over-trained and data-constrained regimes.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (6)
  1. Zachary Ankner (10 papers)
  2. Cody Blakeney (7 papers)
  3. Kartik Sreenivasan (8 papers)
  4. Max Marion (2 papers)
  5. Matthew L. Leavitt (9 papers)
  6. Mansheej Paul (12 papers)
Citations (12)

Summary

Perplexed by Perplexity: Perplexity-Based Data Pruning With Small Reference Models

Introduction

Recent advancements in LLMs have motivated the exploration of techniques to improve their efficiency and performance. One promising approach is data pruning, which involves selecting high-quality subsets of large-scale datasets. This paper investigates the feasibility of using small LLMs for perplexity-based data pruning to enhance the pretraining efficiency and downstream performance of larger LLMs.

Methodology

The central hypothesis of this paper is that small LLMs can effectively prune pretraining datasets based on perplexity, which measures the uncertainty of a model in predicting the next token. The process involves training a small reference model on a random subset of the dataset, computing the perplexity for each sample, and then pruning the dataset according to perplexity scores. The pruned dataset is used to train a much larger model. This method is proposed to reduce pretraining steps while maintaining or improving downstream performance.

The authors focus on different dataset compositions by evaluating on both the Pile and Dolma, investigating various perplexity selection criteria and rates, and analyzing the impact of pruning in over-trained and data-constrained regimes.

Results

Perplexity-Based Data Pruning

The experiments reveal significant improvements in downstream task performance for models trained on perplexity-pruned datasets. For instance, using a 125M parameter reference model to prune data for a 3B parameter model yields an average downstream task performance improvement of up to 2.04 points. Furthermore, pruned models achieve comparable performance to baseline models in up to 1.45x fewer pretraining steps, demonstrating enhanced training efficiency.

Pruning in Non-Standard Regimes

The paper expands the analysis to over-trained and data-constrained regimes. In over-trained settings (training durations up to 5x the Chinchilla optimal number of tokens), perplexity pruning still provides substantial gains. For data-constrained scenarios, perplexity-based pruning shows benefits for up to two data repetitions before performance gains diminish.

Sensitivity to Sampling Criteria

A critical finding is the sensitivity of pruning effectiveness to dataset composition and selection criteria. On the Pile, high perplexity samples yield the best results, whereas on Dolma, medium perplexity selection is optimal. This indicates that domain-specific characteristics influence the optimal pruning strategy.

Evaluation Metrics

Interestingly, the paper uncovers that upstream metrics such as perplexity on a held-out test set do not correlate well with downstream task performance when evaluating pruning methods. Models trained on pruned data can exhibit worse test set perplexity yet significantly better downstream performance, suggesting that downstream evaluations provide a more accurate assessment of model quality.

Implications and Future Directions

The practical implications of these findings are profound. Efficient pruning techniques using small models can greatly reduce the computational resources required to train larger models, making it feasible to sustain rapid advancements in AI without prohibitive costs. Additionally, the sensitivity to dataset composition underlines the need for adaptive pruning strategies tailored to specific data domains.

Theoretically, these results offer insights into the relationship between data quality, measured by perplexity, and model training dynamics. The authors’ introduction of practical metrics beyond traditional perplexity evaluations can potentially refine criteria for model assessments.

Future research is poised to make further strides in optimizing selection criteria for various dataset compositions and exploring other neural heuristics for data pruning. Investigating other non-standard training regimes and their interaction with pruning techniques remains an open avenue. Moreover, expanding the framework to multimodal datasets, involving text, images, and other data forms, could provide a holistic approach to data pruning.

Conclusion

This paper demonstrates that small reference models can effectively prune pretraining data for significantly larger LLMs, leading to improved downstream performance and increased training efficiency. The research findings highlight the importance of adaptive pruning strategies and the need for evaluating models using relevant downstream tasks rather than conventional perplexity metrics. The implications for both practical applications and theoretical understanding are substantial, suggesting new directions for future research in data pruning and model training methodologies.