INGENIOUS: Using Informative Data Subsets for Efficient Pre-Training of Language Models (2305.06677v2)
Abstract: A salient characteristic of pre-trained LLMs (PTLMs) is a remarkable improvement in their generalization capability and emergence of new capabilities with increasing model capacity and pre-training dataset size. Consequently, we are witnessing the development of enormous models pushing the state-of-the-art. It is, however, imperative to realize that this inevitably leads to prohibitively long training times, extortionate computing costs, and a detrimental environmental impact. Significant efforts are underway to make PTLM training more efficient through innovations in model architectures, training pipelines, and loss function design, with scant attention being paid to optimizing the utility of training data. The key question that we ask is whether it is possible to train PTLMs by employing only highly informative subsets of the training data while maintaining downstream performance? Building upon the recent progress in informative data subset selection, we show how we can employ submodular optimization to select highly representative subsets of the training corpora and demonstrate that the proposed framework can be applied to efficiently train multiple PTLMs (BERT, BioBERT, GPT-2) using only a fraction of data. Further, we perform a rigorous empirical evaluation to show that the resulting models achieve up to $\sim99\%$ of the performance of the fully-trained models. We made our framework publicly available at https://github.com/Efficient-AI/ingenious.
- Akiko Aizawa. 2003. An information-theoretic perspective of tf–idf measures. Information Processing & Management, 39(1):45–65.
- A review on language models as knowledge bases.
- Francis Bach. 2013. Learning with submodular functions: A convex optimization perspective. Foundations and Trends® in Machine Learning, 6(2-3):145–373.
- Francis Bach. 2019. Submodular functions: from discrete to continuous domains. Mathematical Programming, 175(1):419–459.
- Jeff Bilmes. 2022. Submodularity in machine learning and artificial intelligence.
- Semantic redundancies in image-classification datasets: The 10% you don’t need. ArXiv, abs/1901.11409.
- Data efficient masked language modeling for vision and language. In Findings of the Association for Computational Linguistics: EMNLP 2021, pages 3013–3028, Punta Cana, Dominican Republic. Association for Computational Linguistics.
- Language models are few-shot learners.
- Trevor Campbell and Tamara Broderick. 2018. Bayesian coreset construction via greedy iterative geodesic ascent. In International Conference on Machine Learning, pages 698–706.
- The lottery ticket hypothesis for pre-trained bert networks. In Advances in Neural Information Processing Systems, volume 33, pages 15834–15846. Curran Associates, Inc.
- Alexandre de Brébisson and Pascal Vincent. 2016. An exploration of softmax alternatives belonging to the spherical loss family. In 4th International Conference on Learning Representations, ICLR 2016, San Juan, Puerto Rico, May 2-4, 2016, Conference Track Proceedings.
- BERT: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 4171–4186, Minneapolis, Minnesota. Association for Computational Linguistics.
- Dan Feldman. 2020. Core-Sets: Updated Survey, pages 23–44. Springer International Publishing, Cham.
- Satoru Fujishige. 2005. Submodular functions and optimization. Elsevier.
- Jonas Geiping and Tom Goldstein. 2022. Cramming: Training a language model on a single gpu in one day. arXiv preprint arXiv:2212.14034.
- Compressing BERT: Studying the effects of weight pruning on transfer learning. In Proceedings of the 5th Workshop on Representation Learning for NLP, pages 143–155, Online. Association for Computational Linguistics.
- Visualizing and understanding the effectiveness of BERT. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 4143–4152, Hong Kong, China. Association for Computational Linguistics.
- Sariel Har-Peled and Soham Mazumdar. 2004. On coresets for k-means and k-median clustering. In Proceedings of the thirty-sixth annual ACM symposium on Theory of computing, pages 291–300.
- John Hewitt and Christopher D. Manning. 2019. A structural probe for finding syntax in word representations. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 4129–4138, Minneapolis, Minnesota. Association for Computational Linguistics.
- Rishabh Iyer and Jeffrey Bilmes. 2019. A memoization framework for scaling submodular optimization to large scale problems. In The 22nd International Conference on Artificial Intelligence and Statistics, pages 2340–2349. PMLR.
- Submodular combinatorial information measures with applications in machine learning. In Algorithmic Learning Theory, pages 722–754. PMLR.
- Rishabh Krishnan Iyer. 2015. Submodular optimization and machine learning: Theoretical results, unifying and scalable algorithms, and applications. Ph.D. thesis.
- How to train BERT with an academic budget. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 10644–10652, Online and Punta Cana, Dominican Republic. Association for Computational Linguistics.
- What does BERT learn about the structure of language? In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 3651–3657, Florence, Italy. Association for Computational Linguistics.
- TinyBERT: Distilling BERT for natural language understanding. In Findings of the Association for Computational Linguistics: EMNLP 2020, pages 4163–4174, Online. Association for Computational Linguistics.
- Jeremy Kahn. 2021. A.I.’s carbon footprint is big, but easy to reduce, Google researchers say. https://fortune.com/2021/04/21/ai-carbon-footprint-reduce-environmental-impact-of-tech-google-research-study/. [Online; accessed 19-September-2022].
- ORIENT: Submodular mutual information measures for data subset selection under distribution shift. In Thirty-Sixth Conference on Neural Information Processing Systems.
- LM-CORE: Language models with contextually relevant external knowledge. In Findings of the Association for Computational Linguistics: NAACL 2022, pages 750–769, Seattle, United States. Association for Computational Linguistics.
- Learning from less data: A unified data subset selection and active learning framework for computer vision. In 2019 IEEE Winter Conference on Applications of Computer Vision (WACV), pages 1289–1299. IEEE.
- Submodlib: A submodular optimization library. arXiv preprint arXiv:2202.10680.
- AUTOMATA: Gradient based data subset selection for compute-efficient hyper-parameter tuning. In Thirty-Sixth Conference on Neural Information Processing Systems.
- Grad-match: Gradient matching based data subset selection for efficient deep model training. In Proceedings of the 38th International Conference on Machine Learning, volume 139 of Proceedings of Machine Learning Research, pages 5464–5474. PMLR.
- Glister: Generalization based data subset selection for efficient and robust learning. Proceedings of the AAAI Conference on Artificial Intelligence, 35(9):8110–8118.
- RETRIEVE: Coreset selection for efficient and robust semi-supervised learning. In Advances in Neural Information Processing Systems.
- Diederik P Kingma and Jimmy Ba. 2014. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980.
- Katrin Kirchhoff and Jeff Bilmes. 2014. Submodularity for data selection in machine translation. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 131–141.
- Similar: Submodular information measures based active learning in realistic scenarios. In Advances in Neural Information Processing Systems, volume 34, pages 18685–18697. Curran Associates, Inc.
- Deep submodular networks for extractive data summarization. arXiv preprint arXiv:2010.08593.
- Revealing the dark secrets of BERT. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 4365–4374, Hong Kong, China. Association for Computational Linguistics.
- Andreas Krause and Daniel Golovin. 2014. Submodular function maximization. Tractability, 3:71–104.
- Biobert: a pre-trained biomedical language representation model for biomedical text mining. Bioinformatics, 36(4):1234–1240.
- Curriculum learning: A regularization method for efficient and stable billion-scale GPT model pre-training.
- I Loshchilov and F Hutter. 2016. Online batch selection for faster training of neural networks. In International Conference on Learning Representations (ICLR) 2016 Workshop Track.
- Semi-supervised data programming with subset selection. arXiv preprint arXiv:2008.09887.
- Michel Minoux. 1978. Accelerated greedy algorithms for maximizing submodular set functions. In Optimization techniques, pages 234–243. Springer.
- Lazier than lazy greedy. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 29.
- Coresets for data-efficient training of machine learning models. In Proceedings of the 37th International Conference on Machine Learning, volume 119 of Proceedings of Machine Learning Research, pages 6950–6960. PMLR.
- Partitioned gradient matching-based data subset selection for compute-efficient robust asr training. In Findings of the 2022 Conference on Empirical Methods in Natural Language Processing (EMNLP), 2022, Abu Dabhi.
- Ctr-bert: Cost-effective knowledge distillation for billion-parameter teacher models. In NeurIPS Efficient Natural Language and Speech Processing Workshop.
- Pre-training a BERT with curriculum learning by increasing block-size of input text. In Proceedings of the International Conference on Recent Advances in Natural Language Processing (RANLP 2021), pages 989–996, Held Online. INCOMA Ltd.
- An analysis of approximations for maximizing submodular set functions—i. Mathematical programming, 14(1):265–294.
- Deep learning on a data diet: Finding important examples early in training. In Advances in Neural Information Processing Systems, volume 34, pages 20596–20607. Curran Associates, Inc.
- Language models as knowledge bases? In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 2463–2473, Hong Kong, China. Association for Computational Linguistics.
- Adaptive second order coresets for data-efficient machine learning. In Proceedings of the 39th International Conference on Machine Learning, volume 162 of Proceedings of Machine Learning Research, pages 17848–17869. PMLR.
- Language models are unsupervised multitask learners. OpenAI blog, 1(8):9.
- Exploring the limits of transfer learning with a unified text-to-text transformer. J. Mach. Learn. Res., 21(140):1–67.
- The probabilistic relevance framework: Bm25 and beyond. Foundations and Trends® in Information Retrieval, 3(4):333–389.
- A primer in BERTology: What we know about how BERT works. Transactions of the Association for Computational Linguistics, 8:842–866.
- Saïd Salhi. 1991. Discrete location theory. Journal of the Operational Research Society, 42:1124–1125.
- Distilbert, a distilled version of bert: smaller, faster, cheaper and lighter. arXiv preprint arXiv:1910.01108.
- Timo Schick and Hinrich Schütze. 2021. It’s not just size that matters: Small language models are also few-shot learners. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 2339–2352, Online. Association for Computational Linguistics.
- Green ai. Communications of the ACM, 63(12):54–63.
- The cost of training nlp models: A concise overview. arXiv preprint arXiv:2004.08900.
- Staged training for transformer language models. In International Conference on Machine Learning, ICML 2022, 17-23 July 2022, Baltimore, Maryland, USA, volume 162 of Proceedings of Machine Learning Research, pages 19893–19908. PMLR.
- Overview of BioCreative II gene mention recognition. Genome Biology, 9 Suppl 2:S2.
- Using DeepSpeed and Megatron to train Megatron-Turing NLG 530B, a large-scale generative language model. arXiv preprint arXiv:2201.11990.
- Beyond neural scaling laws: beating power law scaling via data pruning. In Thirty-Sixth Conference on Neural Information Processing Systems.
- Beyond the imitation game: Quantifying and extrapolating the capabilities of language models. arXiv preprint arXiv:2206.04615.
- ChemProt: a disease chemical biology database. Nucleic Acids Research, 39(Database issue):D367–372.
- Submodularity in action: From machine learning to signal processing applications. IEEE Signal Processing Magazine, 37(5):120–133.
- An empirical study of example forgetting during deep neural network learning. In ICLR.
- GLUE: A multi-task benchmark and analysis platform for natural language understanding. In International Conference on Learning Representations.
- Submodularity in data subset selection and active learning. In International Conference on Machine Learning, pages 1954–1963. PMLR.
- Submodular subset selection for large-scale speech training data. In 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 3311–3315. IEEE.
- Unsupervised submodular subset selection for speech data. In 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 4107–4111. IEEE.
- Xlnet: Generalized autoregressive pretraining for language understanding.
- NLP from scratch without large-scale pretraining: A simple and efficient framework. In Proceedings of the 39th International Conference on Machine Learning, volume 162 of Proceedings of Machine Learning Research, pages 25438–25451. PMLR.
- Prune once for all: Sparse pre-trained language models. arXiv preprint arXiv:2111.05754.
- Opt: Open pre-trained transformer language models. arXiv preprint arXiv:2205.01068.
- H S V N S Kowndinya Renduchintala (3 papers)
- Krishnateja Killamsetty (17 papers)
- Sumit Bhatia (30 papers)
- Milan Aggarwal (17 papers)
- Ganesh Ramakrishnan (88 papers)
- Rishabh Iyer (70 papers)
- Balaji Krishnamurthy (68 papers)