Text Quality-Based Pruning for Efficient Training of Language Models (2405.01582v3)
Abstract: In recent times training LLMs (LMs) have relied on computationally heavy training over massive datasets which makes this training process extremely laborious. In this paper we propose a novel method for numerically evaluating text quality in large unlabelled NLP datasets in a model agnostic manner to assign the text instances a "quality score". By proposing the text quality metric, the paper establishes a framework to identify and eliminate low-quality text instances, leading to improved training efficiency for LM models. Experimental results over multiple models and datasets demonstrate the efficacy of this approach, showcasing substantial gains in training effectiveness and highlighting the potential for resource-efficient LM training. For example, we observe an absolute accuracy improvement of 0.9% averaged over 14 downstream evaluation tasks for multiple LM models while using 40% lesser data and training 42% faster when training on the OpenWebText dataset and 0.8% average absolute accuracy improvement while using 20% lesser data and training 21% faster on the Wikipedia dataset.
- Falcon-40B: an open large language model with state-of-the-art performance.
- Pythia: A suite for analyzing large language models across training and scaling. arXiv preprint arXiv:2304.01373.
- Piqa: Reasoning about physical commonsense in natural language.
- Gpt-neox-20b: An open-source autoregressive language model. arXiv preprint arXiv:2204.06745.
- All that’s ’human’ is not gold: Evaluating human evaluation of generated text.
- Think you have solved question answering? try arc, the ai2 reasoning challenge.
- Toxicity in chatgpt: Analyzing persona-assigned language models.
- The pile: An 800gb dataset of diverse text for language modeling.
- A framework for few-shot language model evaluation.
- Chatgpt outperforms crowd-workers for text-annotation tasks.
- Openwebtext corpus.
- spaCy: Industrial-strength Natural Language Processing in Python.
- Validating large language models with relm.
- The winograd schema challenge. AAAI Press.
- G-eval: Nlg evaluation using gpt-4 with better human alignment.
- Roberta: A robustly optimized bert pretraining approach.
- Can a suit of armor conduct electricity? a new dataset for open book question answering.
- The refinedweb dataset for falcon llm: Outperforming curated corpora with web data, and web data only.
- Filtering, distillation, and hard negatives for vision-language pre-training. arXiv preprint arXiv:2301.02280.
- Language models are unsupervised multitask learners. OpenAI blog, 1(8):9.
- Winogrande: An adversarial winograd schema challenge at scale.
- Large pre-trained language models contain human-like biases of what is right and wrong to do.
- Story cloze task: UW NLP system. In Proceedings of the 2nd Workshop on Linking Models of Lexical, Sentential and Discourse-level Semantics. Association for Computational Linguistics.
- Llama: Open and efficient foundation language models.
- Wikimedia downloads.
- Superglue: A stickier benchmark for general-purpose language understanding systems.
- Transformers: State-of-the-Art Natural Language Processing. pages 38–45. Association for Computational Linguistics.
- Hellaswag: Can a machine really finish your sentence?
- Opt: Open pre-trained transformer language models. arXiv preprint arXiv:2205.01068.
- Lima: Less is more for alignment.
- Vasu Sharma (31 papers)
- Karthik Padthe (4 papers)
- Newsha Ardalani (17 papers)
- Kushal Tirumala (17 papers)
- Russell Howes (6 papers)
- Hu Xu (87 papers)
- Po-Yao Huang (31 papers)
- Shang-Wen Li (55 papers)
- Armen Aghajanyan (31 papers)
- Gargi Ghosh (30 papers)
- Luke Zettlemoyer (225 papers)