Text Quality-Based Pruning for Efficient Training of Language Models (2405.01582v3)

Published 26 Apr 2024 in cs.CL, cs.AI, and cs.LG

Abstract: In recent times training LLMs (LMs) have relied on computationally heavy training over massive datasets which makes this training process extremely laborious. In this paper we propose a novel method for numerically evaluating text quality in large unlabelled NLP datasets in a model agnostic manner to assign the text instances a "quality score". By proposing the text quality metric, the paper establishes a framework to identify and eliminate low-quality text instances, leading to improved training efficiency for LM models. Experimental results over multiple models and datasets demonstrate the efficacy of this approach, showcasing substantial gains in training effectiveness and highlighting the potential for resource-efficient LM training. For example, we observe an absolute accuracy improvement of 0.9% averaged over 14 downstream evaluation tasks for multiple LM models while using 40% lesser data and training 42% faster when training on the OpenWebText dataset and 0.8% average absolute accuracy improvement while using 20% lesser data and training 21% faster on the Wikipedia dataset.

PDF HTML Abstract

Text Quality Evaluation for Efficient LLM Training

Overview of the Paper

This paper introduces a methodology for numerically evaluating text quality in large unlabelled NLP datasets, which is independent of the model used. By assigning a "quality score" to text instances, it allows for the pruning of low-quality data. This approach leads to more efficient training of LLMs (LMs) by using less data and reducing training time, while maintaining or even improving model accuracy.

Understanding Text Quality Scores

The concept of "text quality" might seem subjective, but the authors propose a measurable and scalable method to evaluate it. Here's how they approach it:

Weight Calculation: Based on 14 pre-defined heuristics, such as text complexity and syntax, each piece of text is evaluated. These heuristics are applied to subsets of the dataset, and a pre-trained LM calculates the perplexity (PPL) for these subsets. A weight is assigned to each heuristic based on how much it improves the PPL compared to the entire dataset.
Quality Scoring: Each line of text is then scored by applying all the heuristics. The scores are combined, considering the weights of the heuristics, to compute a final quality score for each line. These are then aggregated to obtain a score for each document.

The novelty here lies in the model-agnostic approach—once the text quality scores are determined, they can be used for any LM model, avoiding repeated calculations.

Pruning Lower Quality Data

Using the quality scores, the researchers were able to prune data that didn’t meet a certain quality threshold. The thresholds were set at different percentiles (20%, 40%, 60%, 80%), and the models trained on these pruned datasets were compared against models trained on the full, unpruned datasets.

Experimental Results and Implications

The experiments showed promising results:

Training LMs on the pruned dataset (top 40% of OpenWebText and top 20% of Wikipedia) led to faster training times by up to 42% and used up to 40% less data.
Despite the reduction in data and training time, the models saw an absolute accuracy improvement of approximately 0.9% on OpenWebText and 0.8% on Wikipedia, averaged over 14 downstream NLP tasks.

These results suggest that a significant portion of large datasets can be low-quality or irrelevant, contributing little to model performance. By eliminating this chaff, training becomes more resource-efficient without sacrificing effectiveness.

Future Prospects and Research Directions

This research opens several avenues for further exploration. Applying these methods to larger models or more diverse datasets could test the scalability and generalizability of the approach. Additionally, refining the quality assessment to cover more nuanced aspects of text quality, such as bias or semantic coherence, could enhance the robustness of the training data and by extension, of the models trained on them.

Challenges and Limitations

The current paper, while insightful, has its limitations. It focuses on smaller models and doesn’t extend to heavyweight models with billions of parameters. Also, the generalizability of the findings to datasets significantly larger than those tested remains to be established.

Ethical Considerations

The ability to prune harmful or low-quality content from training datasets is a crucial step toward ethically responsible AI development. However, the definition of "quality" in text remains complex, intertwined with cultural and contextual nuances. Ongoing research must continue to address these ethical challenges to maximize fairness and minimize bias in AI applications.

Conclusion

The paper presents a promising methodology for improving the efficiency of LLM training by introducing a quantitative, model-agnostic framework for evaluating text quality. As AI models grow in complexity and the datasets they train on balloon in size, such innovations are crucial in harnessing more sustainable and effective AI training processes.

PDF Markdown Bookmark Chat (Pro)

References (30)

Authors (11)

Vasu Sharma (31 papers)
Karthik Padthe (4 papers)
Newsha Ardalani (17 papers)
Kushal Tirumala (17 papers)
Russell Howes (6 papers)
Hu Xu (87 papers)
Po-Yao Huang (31 papers)
Shang-Wen Li (55 papers)
Armen Aghajanyan (31 papers)
Gargi Ghosh (30 papers)
Luke Zettlemoyer (225 papers)

Citations (5)

View on Semantic Scholar

Related Papers

Find Related Papers

Tweets

https://twitter.com/fly51fly/status/1787364831726563588

https://twitter.com/datologyai/status/1792967334606545204

https://twitter.com/GAIS_jp/status/1792691830573633582

https://twitter.com/realmofresearch/status/1787320422830112874