Perplexed by Perplexity: Perplexity-Based Data Pruning With Small Reference Models
Introduction
Recent advancements in LLMs have motivated the exploration of techniques to improve their efficiency and performance. One promising approach is data pruning, which involves selecting high-quality subsets of large-scale datasets. This paper investigates the feasibility of using small LLMs for perplexity-based data pruning to enhance the pretraining efficiency and downstream performance of larger LLMs.
Methodology
The central hypothesis of this paper is that small LLMs can effectively prune pretraining datasets based on perplexity, which measures the uncertainty of a model in predicting the next token. The process involves training a small reference model on a random subset of the dataset, computing the perplexity for each sample, and then pruning the dataset according to perplexity scores. The pruned dataset is used to train a much larger model. This method is proposed to reduce pretraining steps while maintaining or improving downstream performance.
The authors focus on different dataset compositions by evaluating on both the Pile and Dolma, investigating various perplexity selection criteria and rates, and analyzing the impact of pruning in over-trained and data-constrained regimes.
Results
Perplexity-Based Data Pruning
The experiments reveal significant improvements in downstream task performance for models trained on perplexity-pruned datasets. For instance, using a 125M parameter reference model to prune data for a 3B parameter model yields an average downstream task performance improvement of up to 2.04 points. Furthermore, pruned models achieve comparable performance to baseline models in up to 1.45x fewer pretraining steps, demonstrating enhanced training efficiency.
Pruning in Non-Standard Regimes
The paper expands the analysis to over-trained and data-constrained regimes. In over-trained settings (training durations up to 5x the Chinchilla optimal number of tokens), perplexity pruning still provides substantial gains. For data-constrained scenarios, perplexity-based pruning shows benefits for up to two data repetitions before performance gains diminish.
Sensitivity to Sampling Criteria
A critical finding is the sensitivity of pruning effectiveness to dataset composition and selection criteria. On the Pile, high perplexity samples yield the best results, whereas on Dolma, medium perplexity selection is optimal. This indicates that domain-specific characteristics influence the optimal pruning strategy.
Evaluation Metrics
Interestingly, the paper uncovers that upstream metrics such as perplexity on a held-out test set do not correlate well with downstream task performance when evaluating pruning methods. Models trained on pruned data can exhibit worse test set perplexity yet significantly better downstream performance, suggesting that downstream evaluations provide a more accurate assessment of model quality.
Implications and Future Directions
The practical implications of these findings are profound. Efficient pruning techniques using small models can greatly reduce the computational resources required to train larger models, making it feasible to sustain rapid advancements in AI without prohibitive costs. Additionally, the sensitivity to dataset composition underlines the need for adaptive pruning strategies tailored to specific data domains.
Theoretically, these results offer insights into the relationship between data quality, measured by perplexity, and model training dynamics. The authors’ introduction of practical metrics beyond traditional perplexity evaluations can potentially refine criteria for model assessments.
Future research is poised to make further strides in optimizing selection criteria for various dataset compositions and exploring other neural heuristics for data pruning. Investigating other non-standard training regimes and their interaction with pruning techniques remains an open avenue. Moreover, expanding the framework to multimodal datasets, involving text, images, and other data forms, could provide a holistic approach to data pruning.
Conclusion
This paper demonstrates that small reference models can effectively prune pretraining data for significantly larger LLMs, leading to improved downstream performance and increased training efficiency. The research findings highlight the importance of adaptive pruning strategies and the need for evaluating models using relevant downstream tasks rather than conventional perplexity metrics. The implications for both practical applications and theoretical understanding are substantial, suggesting new directions for future research in data pruning and model training methodologies.