Automatic Pruning of Fine-tuning Datasets for Transformer-based LLMs
The paper "Automatic Pruning of Fine-tuning Datasets for Transformer-based LLMs" investigates an approach to streamline the fine-tuning process of transformer-based LLMs by introducing an automatic dataset pruning method. The primary goal is to mitigate the computational burden while maintaining or even enhancing the model's performance on downstream tasks.
Methodology
The cornerstone of this method is the proposed dataset scoring function termed as H-score. This scoring mechanism is derived based on the model's success rate in correctly classifying each training data point over multiple fine-tuning runs. Specifically, H-score assigns higher values to data points consistently classified correctly and lower values to those repeatedly misclassified.
Given a dataset D and a set of H-scores, different subsets of the training data can be constructed by excluding data points with the highest and lowest H-scores. These scores demarcate points either too easy or too difficult for the model to classify, thus providing a rational basis for pruning.
Empirical Analysis
The paper evaluates the effectiveness of the pruned subsets across five downstream tasks: MNLI, SNLI, SST-2, SQuAD v2, and RACE, using two models: RoBERTa\textsubscript{\textrm{LARGE}} and OPT\textsubscript{\textrm{350M}}. Pruning subsets are created by removing both the most difficult and the easiest data points. The largest of these subsets, termed the Winning Ticket Subset, retains approximately 33% of the original training set. The experimental results highlight that fine-tuning on these winning ticket subsets leads to a 0.1% increase in evaluation performance on average.
Results and Implications
The findings demonstrate that the proposed model-agnostic H-score effectively identifies and prunes counterproductive training data, leading to efficient sub-datasets for fine-tuning. This data-driven approach eliminates the need for manual hyperparameter tuning to determine the subset size, which significantly reduces computational overhead.
On average, the winning ticket subsets yielded performance comparable to full dataset fine-tuning, with some tasks showing performance improvements. For instance, fine-tuning OPT\textsubscript{\textrm{350M}} on the SST-2 task saw a 1.2% increase in accuracy despite using only 21% of the total training data.
Comparative Analysis
The proposed H-score method is juxtaposed against an alternative dataset pruning method based on variability metrics, derived from each model's probability predictions. The H-score demonstrated superior or comparable performance, particularly by avoiding over-reliance on data points with high variability, which were prone to destabilizing the fine-tuning process.
Theoretical and Practical Contributions
From a theoretical standpoint, this paper contributes to understanding the differential impact of individual data points in fine-tuning transformer models. Practically, it offers a principled approach to dataset pruning, yielding smaller, more manageable datasets without compromising performance.
Future Directions
Future research could extend this approach to a broader array of transformer models and more diverse tasks, including adversarial datasets to evaluate robustness. Additionally, there is potential to integrate this pruning technique into other machine learning workflows such as neural architecture search (NAS), where trimming down dataset size can lead to significant resource savings.
Conclusion
The paper presents a robust method for automatic dataset pruning, leveraging the H-score to dynamically reduce dataset size while maintaining high evaluation performance. By aligning the complexity of the fine-tuning subset with the model's capabilities, this approach elucidates a cost-effective strategy for optimizing transformer-based LLMs, paving the way for more efficient and resource-conserving AI applications.