Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
119 tokens/sec
GPT-4o
56 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Automatic Pruning of Fine-tuning Datasets for Transformer-based Language Models (2407.08887v1)

Published 11 Jul 2024 in cs.CL and cs.LG

Abstract: Transformer-based LLMs have shown state-of-the-art performance on a variety of natural language understanding tasks. To achieve this performance, these models are first pre-trained on general corpus and then fine-tuned on downstream tasks. Previous work studied the effect of pruning the training set of the downstream tasks on the performance of the model on its evaluation set. In this work, we propose an automatic dataset pruning method for the training set of fine-tuning tasks. Our method is based on the model's success rate in correctly classifying each training data point. Unlike previous work which relies on user feedback to determine subset size, our method automatically extracts training subsets that are adapted for each pair of model and fine-tuning task. Our method provides multiple subsets for use in dataset pruning that navigate the trade-off between subset size and evaluation accuracy. Our largest subset, which we also refer to as the winning ticket subset, is on average $3 \times$ smaller than the original training set of the fine-tuning task. Our experiments on 5 downstream tasks and 2 LLMs show that, on average, fine-tuning on the winning ticket subsets results in a $0.1 \%$ increase in the evaluation performance of the model.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (5)
  1. Mohammadreza Tayaranian (6 papers)
  2. Seyyed Hasan Mozafari (2 papers)
  3. Brett H. Meyer (11 papers)
  4. James J. Clark (32 papers)
  5. Warren J. Gross (75 papers)

Summary

Automatic Pruning of Fine-tuning Datasets for Transformer-based LLMs

The paper "Automatic Pruning of Fine-tuning Datasets for Transformer-based LLMs" investigates an approach to streamline the fine-tuning process of transformer-based LLMs by introducing an automatic dataset pruning method. The primary goal is to mitigate the computational burden while maintaining or even enhancing the model's performance on downstream tasks.

Methodology

The cornerstone of this method is the proposed dataset scoring function termed as H\mathcal{H}-score. This scoring mechanism is derived based on the model's success rate in correctly classifying each training data point over multiple fine-tuning runs. Specifically, H\mathcal{H}-score assigns higher values to data points consistently classified correctly and lower values to those repeatedly misclassified.

Given a dataset D\mathcal{D} and a set of H\mathcal{H}-scores, different subsets of the training data can be constructed by excluding data points with the highest and lowest H\mathcal{H}-scores. These scores demarcate points either too easy or too difficult for the model to classify, thus providing a rational basis for pruning.

Empirical Analysis

The paper evaluates the effectiveness of the pruned subsets across five downstream tasks: MNLI, SNLI, SST-2, SQuAD v2, and RACE, using two models: RoBERTa\textsubscript{\textrm{LARGE}} and OPT\textsubscript{\textrm{350M}}. Pruning subsets are created by removing both the most difficult and the easiest data points. The largest of these subsets, termed the Winning Ticket Subset, retains approximately 33% of the original training set. The experimental results highlight that fine-tuning on these winning ticket subsets leads to a 0.1% increase in evaluation performance on average.

Results and Implications

The findings demonstrate that the proposed model-agnostic H\mathcal{H}-score effectively identifies and prunes counterproductive training data, leading to efficient sub-datasets for fine-tuning. This data-driven approach eliminates the need for manual hyperparameter tuning to determine the subset size, which significantly reduces computational overhead.

On average, the winning ticket subsets yielded performance comparable to full dataset fine-tuning, with some tasks showing performance improvements. For instance, fine-tuning OPT\textsubscript{\textrm{350M}} on the SST-2 task saw a 1.2% increase in accuracy despite using only 21% of the total training data.

Comparative Analysis

The proposed H\mathcal{H}-score method is juxtaposed against an alternative dataset pruning method based on variability metrics, derived from each model's probability predictions. The H\mathcal{H}-score demonstrated superior or comparable performance, particularly by avoiding over-reliance on data points with high variability, which were prone to destabilizing the fine-tuning process.

Theoretical and Practical Contributions

From a theoretical standpoint, this paper contributes to understanding the differential impact of individual data points in fine-tuning transformer models. Practically, it offers a principled approach to dataset pruning, yielding smaller, more manageable datasets without compromising performance.

Future Directions

Future research could extend this approach to a broader array of transformer models and more diverse tasks, including adversarial datasets to evaluate robustness. Additionally, there is potential to integrate this pruning technique into other machine learning workflows such as neural architecture search (NAS), where trimming down dataset size can lead to significant resource savings.

Conclusion

The paper presents a robust method for automatic dataset pruning, leveraging the H\mathcal{H}-score to dynamically reduce dataset size while maintaining high evaluation performance. By aligning the complexity of the fine-tuning subset with the model's capabilities, this approach elucidates a cost-effective strategy for optimizing transformer-based LLMs, paving the way for more efficient and resource-conserving AI applications.