Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
102 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Text Quality-Based Pruning for Efficient Training of Language Models (2405.01582v3)

Published 26 Apr 2024 in cs.CL, cs.AI, and cs.LG

Abstract: In recent times training LLMs (LMs) have relied on computationally heavy training over massive datasets which makes this training process extremely laborious. In this paper we propose a novel method for numerically evaluating text quality in large unlabelled NLP datasets in a model agnostic manner to assign the text instances a "quality score". By proposing the text quality metric, the paper establishes a framework to identify and eliminate low-quality text instances, leading to improved training efficiency for LM models. Experimental results over multiple models and datasets demonstrate the efficacy of this approach, showcasing substantial gains in training effectiveness and highlighting the potential for resource-efficient LM training. For example, we observe an absolute accuracy improvement of 0.9% averaged over 14 downstream evaluation tasks for multiple LM models while using 40% lesser data and training 42% faster when training on the OpenWebText dataset and 0.8% average absolute accuracy improvement while using 20% lesser data and training 21% faster on the Wikipedia dataset.

Text Quality Evaluation for Efficient LLM Training

Overview of the Paper

This paper introduces a methodology for numerically evaluating text quality in large unlabelled NLP datasets, which is independent of the model used. By assigning a "quality score" to text instances, it allows for the pruning of low-quality data. This approach leads to more efficient training of LLMs (LMs) by using less data and reducing training time, while maintaining or even improving model accuracy.

Understanding Text Quality Scores

The concept of "text quality" might seem subjective, but the authors propose a measurable and scalable method to evaluate it. Here's how they approach it:

  • Weight Calculation: Based on 14 pre-defined heuristics, such as text complexity and syntax, each piece of text is evaluated. These heuristics are applied to subsets of the dataset, and a pre-trained LM calculates the perplexity (PPL) for these subsets. A weight is assigned to each heuristic based on how much it improves the PPL compared to the entire dataset.
  • Quality Scoring: Each line of text is then scored by applying all the heuristics. The scores are combined, considering the weights of the heuristics, to compute a final quality score for each line. These are then aggregated to obtain a score for each document.

The novelty here lies in the model-agnostic approach—once the text quality scores are determined, they can be used for any LM model, avoiding repeated calculations.

Pruning Lower Quality Data

Using the quality scores, the researchers were able to prune data that didn’t meet a certain quality threshold. The thresholds were set at different percentiles (20%, 40%, 60%, 80%), and the models trained on these pruned datasets were compared against models trained on the full, unpruned datasets.

Experimental Results and Implications

The experiments showed promising results:

  • Training LMs on the pruned dataset (top 40% of OpenWebText and top 20% of Wikipedia) led to faster training times by up to 42% and used up to 40% less data.
  • Despite the reduction in data and training time, the models saw an absolute accuracy improvement of approximately 0.9% on OpenWebText and 0.8% on Wikipedia, averaged over 14 downstream NLP tasks.

These results suggest that a significant portion of large datasets can be low-quality or irrelevant, contributing little to model performance. By eliminating this chaff, training becomes more resource-efficient without sacrificing effectiveness.

Future Prospects and Research Directions

This research opens several avenues for further exploration. Applying these methods to larger models or more diverse datasets could test the scalability and generalizability of the approach. Additionally, refining the quality assessment to cover more nuanced aspects of text quality, such as bias or semantic coherence, could enhance the robustness of the training data and by extension, of the models trained on them.

Challenges and Limitations

The current paper, while insightful, has its limitations. It focuses on smaller models and doesn’t extend to heavyweight models with billions of parameters. Also, the generalizability of the findings to datasets significantly larger than those tested remains to be established.

Ethical Considerations

The ability to prune harmful or low-quality content from training datasets is a crucial step toward ethically responsible AI development. However, the definition of "quality" in text remains complex, intertwined with cultural and contextual nuances. Ongoing research must continue to address these ethical challenges to maximize fairness and minimize bias in AI applications.

Conclusion

The paper presents a promising methodology for improving the efficiency of LLM training by introducing a quantitative, model-agnostic framework for evaluating text quality. As AI models grow in complexity and the datasets they train on balloon in size, such innovations are crucial in harnessing more sustainable and effective AI training processes.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (30)
  1. Falcon-40B: an open large language model with state-of-the-art performance.
  2. Pythia: A suite for analyzing large language models across training and scaling. arXiv preprint arXiv:2304.01373.
  3. Piqa: Reasoning about physical commonsense in natural language.
  4. Gpt-neox-20b: An open-source autoregressive language model. arXiv preprint arXiv:2204.06745.
  5. All that’s ’human’ is not gold: Evaluating human evaluation of generated text.
  6. Think you have solved question answering? try arc, the ai2 reasoning challenge.
  7. Toxicity in chatgpt: Analyzing persona-assigned language models.
  8. The pile: An 800gb dataset of diverse text for language modeling.
  9. A framework for few-shot language model evaluation.
  10. Chatgpt outperforms crowd-workers for text-annotation tasks.
  11. Openwebtext corpus.
  12. spaCy: Industrial-strength Natural Language Processing in Python.
  13. Validating large language models with relm.
  14. The winograd schema challenge. AAAI Press.
  15. G-eval: Nlg evaluation using gpt-4 with better human alignment.
  16. Roberta: A robustly optimized bert pretraining approach.
  17. Can a suit of armor conduct electricity? a new dataset for open book question answering.
  18. The refinedweb dataset for falcon llm: Outperforming curated corpora with web data, and web data only.
  19. Filtering, distillation, and hard negatives for vision-language pre-training. arXiv preprint arXiv:2301.02280.
  20. Language models are unsupervised multitask learners. OpenAI blog, 1(8):9.
  21. Winogrande: An adversarial winograd schema challenge at scale.
  22. Large pre-trained language models contain human-like biases of what is right and wrong to do.
  23. Story cloze task: UW NLP system. In Proceedings of the 2nd Workshop on Linking Models of Lexical, Sentential and Discourse-level Semantics. Association for Computational Linguistics.
  24. Llama: Open and efficient foundation language models.
  25. Wikimedia downloads.
  26. Superglue: A stickier benchmark for general-purpose language understanding systems.
  27. Transformers: State-of-the-Art Natural Language Processing. pages 38–45. Association for Computational Linguistics.
  28. Hellaswag: Can a machine really finish your sentence?
  29. Opt: Open pre-trained transformer language models. arXiv preprint arXiv:2205.01068.
  30. Lima: Less is more for alignment.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (11)
  1. Vasu Sharma (31 papers)
  2. Karthik Padthe (4 papers)
  3. Newsha Ardalani (17 papers)
  4. Kushal Tirumala (17 papers)
  5. Russell Howes (6 papers)
  6. Hu Xu (87 papers)
  7. Po-Yao Huang (31 papers)
  8. Shang-Wen Li (55 papers)
  9. Armen Aghajanyan (31 papers)
  10. Gargi Ghosh (30 papers)
  11. Luke Zettlemoyer (225 papers)
Citations (5)