Can pruning make Large Language Models more efficient? (2310.04573v1)
Abstract: Transformer models have revolutionized natural language processing with their unparalleled ability to grasp complex contextual relationships. However, the vast number of parameters in these models has raised concerns regarding computational efficiency, environmental impact, and deployability on resource-limited platforms. To address these challenges, this paper investigates the application of weight pruning-a strategic reduction of model parameters based on their significance-as an optimization strategy for Transformer architectures. Through extensive experimentation, we explore various pruning methodologies, highlighting their impact on model performance, size, and computational demands. Our findings suggest that with judicious selection of pruning hyperparameters, significant reductions in model size are attainable without considerable compromise on performance. Moreover, when coupled with post-pruning fine-tuning strategies, some pruned models even exhibit enhanced generalization capabilities. This work seeks to bridge the gap between model efficiency and performance, paving the way for more scalable and environmentally responsible deep learning applications.
- Text classification for online conversations with machine learning on aws. AWS Machine Learning Blog, 2022.
- An analysis of the askmsr question-answering system. In Proceedings of the 2002 Conference on Empirical Methods in Natural Language Processing (EMNLP 2002), pages 257–264, 2002.
- Language models are few-shot learners. arXiv preprint arXiv:2005.14165, 2020.
- Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805, 2018.
- Reducing transformer depth on demand with structured dropout. arXiv preprint arXiv:1909.11556, 2019.
- Building watson: An overview of the deepqa project. AI magazine, 31(3):59–79, 2010.
- The state of sparsity in deep neural networks. arXiv preprint arXiv:1902.09574, 2019.
- Alexa, predict my flight delay. arXiv preprint arXiv:2208.09921, 2022a.
- Flight delay prediction using deep learning and conversational voice-based agents. American Academic Scientific Research Journal for Engineering, Technology, and Sciences, 89(1):60–72, 2022b.
- Zero-shot open-book question answering. arXiv preprint arXiv:2111.11520, 2021.
- You don’t need labeled data for open-book question answering. Applied Sciences, 12(1):111, 2022.
- Do generative large language models need billions of parameters? arXiv preprint arXiv:2309.06589, 2023a.
- Can a student large language model perform as well as it’s teacher?, 2023b.
- Create, train, and deploy a billion-parameter language model on terabytes of data with tensorflow and amazon sagemaker. AWS Machine Learning Blog, 2022.
- Deep compression: Compressing deep neural networks with pruning, trained quantization and huffman coding. arXiv preprint arXiv:1510.00149, 2015.
- Eie: Efficient inference engine on compressed deep neural network. ACM SIGARCH Computer Architecture News, 44(3):243–254, 2016.
- Block pruning for faster transformers, 2021. URL https://arxiv.org/abs/2109.04838.
- Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86(11):2278–2324, 1998.
- Pruning filters for efficient convnets. arXiv preprint arXiv:1608.08710, 2016.
- Are sixteen heads really better than one? Advances in neural information processing systems, 32, 2019.
- Pruning convolutional neural networks for resource efficient inference. arXiv preprint arXiv:1611.06440, 2016.
- The structure and performance of an open-domain question answering system. In Proceedings of the 38th annual meeting of the Association for Computational Linguistics, pages 563–570, 2000.
- Structured bayesian pruning via log-normal multiplicative noise. Advances in Neural Information Processing Systems, 30, 2017.
- On the effect of dropping layers of pre-trained transformer models. Computer Speech & Language, 77:101429, 2023.
- Movement pruning: Adaptive sparsity by fine-tuning. Advances in Neural Information Processing Systems, 33:20378–20389, 2020.
- Energy and policy considerations for deep learning in nlp. arXiv preprint arXiv:1906.02243, 2019.
- Attention is all you need. In Advances in neural information processing systems, pages 5998–6008, 2017.
- Analyzing multi-head self-attention: Specialized heads do the heavy lifting, the rest can be pruned. arXiv preprint arXiv:1905.09418, 2019.
- Ellen M Voorhees et al. The trec-8 question answering track report. In Trec, volume 99, pages 77–82. Citeseer, 1999.
- Structured pruning of large language models. arXiv preprint arXiv:1910.04732, 2019.
- Rocket launching: A universal and efficient framework for training well-performing light net. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 32, 2018.
- To prune, or not to prune: exploring the efficacy of pruning for model compression. arXiv preprint arXiv:1710.01878, 2017.
- Sia Gholami (8 papers)
- Marwan Omar (13 papers)