From PEFT to DEFT: Parameter Efficient Finetuning for Reducing Activation Density in Transformers (2402.01911v2)
Abstract: Pretrained LLMs (PLMs) have become the de facto starting point for fine-tuning on downstream tasks. However, as model sizes continue to increase, traditional fine-tuning of all the parameters becomes challenging. To address this, parameter-efficient fine-tuning (PEFT) methods have gained popularity as a means to adapt PLMs effectively. In parallel, recent studies have revealed the presence of activation sparsity within the intermediate outputs of the multilayer perceptron (MLP) blocks in transformers. Low activation density enables efficient model inference on sparsity-aware hardware. Building upon this insight, in this work, we propose a novel density loss that encourages higher activation sparsity (equivalently, lower activation density) in the pre-trained models. We demonstrate the effectiveness of our approach by utilizing mainstream PEFT techniques, including QLoRA, LoRA, Adapter, and Prompt/Prefix Tuning, to facilitate efficient model adaptation across diverse downstream tasks. Experiments show that our proposed method, \textbf{DEFT} (Density-Efficient Fine-Tuning), can consistently reduce activation density by up to \textbf{44.94\%} on RoBERTa$\mathrm{Large}$ and by \textbf{53.19\%} (encoder density) and \textbf{90.60\%} (decoder density) on Flan-T5$\mathrm{XXL}$ (\textbf{11B}) compared to PEFT, using GLUE and QA (SQuAD) benchmarks respectively. We also introduce \textbf{ADA-DEFT}, an adaptive variant of our DEFT approach, which achieves significant memory and runtime savings during inference. For instance, ADA-DEFT reduces runtime by \textbf{8.79\%}and memory usage by \textbf{17.46\%} in Flan-T5$\mathrm{XL}$, and by \textbf{2.79\%} and \textbf{2.54\%} respectively in Flan-T5$\mathrm{XXL}$. Additionally, we showcase that DEFT works complementarily with quantized and pruned models.
- Losing heads in the lottery: Pruning transformer attention in neural machine translation. In Webber, B., Cohn, T., He, Y., and Liu, Y. (eds.), Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing, EMNLP 2020, Online, November 16-20, 2020, pp. 2664–2674. Association for Computational Linguistics, 2020. doi: 10.18653/v1/2020.emnlp-main.211. URL https://doi.org/10.18653/v1/2020.emnlp-main.211.
- DSEE: dually sparsity-embedded efficient tuning of pre-trained language models. CoRR, abs/2111.00160, 2021. URL https://arxiv.org/abs/2111.00160.
- Scaling instruction-finetuned language models. ArXiv, abs/2210.11416, 2022. URL https://api.semanticscholar.org/CorpusID:253018554.
- Llm.int8(): 8-bit matrix multiplication for transformers at scale. CoRR, abs/2208.07339, 2022. doi: 10.48550/arXiv.2208.07339. URL https://doi.org/10.48550/arXiv.2208.07339.
- Qlora: Efficient finetuning of quantized llms. ArXiv, abs/2305.14314, 2023a. URL https://api.semanticscholar.org/CorpusID:258841328.
- Qlora: Efficient finetuning of quantized llms. CoRR, abs/2305.14314, 2023b. doi: 10.48550/ARXIV.2305.14314. URL https://doi.org/10.48550/arXiv.2305.14314.
- Bert: Pre-training of deep bidirectional transformers for language understanding. ArXiv, abs/1810.04805, 2019.
- An image is worth 16x16 words: Transformers for image recognition at scale. ArXiv, abs/2010.11929, 2020. URL https://api.semanticscholar.org/CorpusID:225039882.
- Gaussian error linear units (gelus). arXiv: Learning, 2016. URL https://api.semanticscholar.org/CorpusID:125617073.
- Parameter-efficient transfer learning for NLP. In Chaudhuri, K. and Salakhutdinov, R. (eds.), Proceedings of the 36th International Conference on Machine Learning, ICML 2019, 9-15 June 2019, Long Beach, California, USA, volume 97 of Proceedings of Machine Learning Research, pp. 2790–2799. PMLR, 2019. URL http://proceedings.mlr.press/v97/houlsby19a.html.
- Universal language model fine-tuning for text classification. In Gurevych, I. and Miyao, Y. (eds.), Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics, ACL 2018, Melbourne, Australia, July 15-20, 2018, Volume 1: Long Papers, pp. 328–339. Association for Computational Linguistics, 2018. doi: 10.18653/v1/P18-1031. URL https://aclanthology.org/P18-1031/.
- Lora: Low-rank adaptation of large language models. In The Tenth International Conference on Learning Representations, ICLR 2022, Virtual Event, April 25-29, 2022. OpenReview.net, 2022. URL https://openreview.net/forum?id=nZeVKeeFYf9.
- Adversarial sparsity attacks on deep neural networks. ArXiv, abs/2006.08020, 2020.
- Learning multiple layers of features from tiny images. 2009.
- Beyond distillation: Task-level mixture-of-experts for efficient inference. In Moens, M., Huang, X., Specia, L., and Yih, S. W. (eds.), Findings of the Association for Computational Linguistics: EMNLP 2021, Virtual Event / Punta Cana, Dominican Republic, 16-20 November, 2021, pp. 3577–3599. Association for Computational Linguistics, 2021. doi: 10.18653/v1/2021.findings-emnlp.304. URL https://doi.org/10.18653/v1/2021.findings-emnlp.304.
- Inducing and exploiting activation sparsity for fast inference on deep neural networks. In Proceedings of the 37th International Conference on Machine Learning, ICML 2020, 13-18 July 2020, Virtual Event, volume 119 of Proceedings of Machine Learning Research, pp. 5533–5543. PMLR, 2020. URL http://proceedings.mlr.press/v119/kurtz20a.html.
- Minimizing energy consumption of deep learning models by energy-aware training. arXiv preprint arXiv:2307.00368, 2023.
- Snip: single-shot network pruning based on connection sensitivity. In 7th International Conference on Learning Representations, ICLR 2019, New Orleans, LA, USA, May 6-9, 2019. OpenReview.net, 2019. URL https://openreview.net/forum?id=B1VZqjAcYX.
- The power of scale for parameter-efficient prompt tuning. In Moens, M., Huang, X., Specia, L., and Yih, S. W. (eds.), Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, EMNLP 2021, Virtual Event / Punta Cana, Dominican Republic, 7-11 November, 2021, pp. 3045–3059. Association for Computational Linguistics, 2021. doi: 10.18653/v1/2021.emnlp-main.243. URL https://doi.org/10.18653/v1/2021.emnlp-main.243.
- Differentiable subset pruning of transformer heads. Trans. Assoc. Comput. Linguistics, 9:1442–1459, 2021. doi: 10.1162/tacl_a_00436. URL https://doi.org/10.1162/tacl_a_00436.
- Prefix-tuning: Optimizing continuous prompts for generation. In Zong, C., Xia, F., Li, W., and Navigli, R. (eds.), Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing, ACL/IJCNLP 2021, (Volume 1: Long Papers), Virtual Event, August 1-6, 2021, pp. 4582–4597. Association for Computational Linguistics, 2021. doi: 10.18653/v1/2021.acl-long.353. URL https://doi.org/10.18653/v1/2021.acl-long.353.
- Large models are parsimonious learners: Activation sparsity in trained transformers. CoRR, abs/2210.06313, 2022. doi: 10.48550/arXiv.2210.06313. URL https://doi.org/10.48550/arXiv.2210.06313.
- Roberta: A robustly optimized bert pretraining approach. ArXiv, abs/1907.11692, 2019. URL https://api.semanticscholar.org/CorpusID:198953378.
- Fixing weight decay regularization in adam. ArXiv, abs/1711.05101, 2017.
- Peft: State-of-the-art parameter-efficient fine-tuning methods. https://github.com/huggingface/peft, 2022.
- Are sixteen heads really better than one? In Wallach, H. M., Larochelle, H., Beygelzimer, A., d’Alché-Buc, F., Fox, E. B., and Garnett, R. (eds.), Advances in Neural Information Processing Systems 32: Annual Conference on Neural Information Processing Systems 2019, NeurIPS 2019, December 8-14, 2019, Vancouver, BC, Canada, pp. 14014–14024, 2019. URL https://proceedings.neurips.cc/paper/2019/hash/2c601ad9d2ff9bc8b282670cdd54f69f-Abstract.html.
- AdapterHub: A framework for adapting transformers. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp. 46–54, Online, October 2020a. Association for Computational Linguistics. doi: 10.18653/v1/2020.emnlp-demos.7. URL https://aclanthology.org/2020.emnlp-demos.7.
- MAD-X: An Adapter-Based Framework for Multi-Task Cross-Lingual Transfer. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 7654–7673, Online, November 2020b. Association for Computational Linguistics. doi: 10.18653/v1/2020.emnlp-main.617. URL https://aclanthology.org/2020.emnlp-main.617.
- Language models are unsupervised multitask learners. 2019a. URL https://api.semanticscholar.org/CorpusID:160025533.
- Language models are unsupervised multitask learners. 2019b. URL https://d4mucfpksywv.cloudfront.net/better-language-models/language_models_are_unsupervised_multitask_learners.pdf.
- Exploring the limits of transfer learning with a unified text-to-text transformer. ArXiv, abs/1910.10683, 2019.
- Exploring the limits of transfer learning with a unified text-to-text transformer. J. Mach. Learn. Res., 21:140:1–140:67, 2020. URL http://jmlr.org/papers/v21/20-074.html.
- Deepspeed-moe: Advancing mixture-of-experts inference and training to power next-generation AI scale. In Chaudhuri, K., Jegelka, S., Song, L., Szepesvári, C., Niu, G., and Sabato, S. (eds.), International Conference on Machine Learning, ICML 2022, 17-23 July 2022, Baltimore, Maryland, USA, volume 162 of Proceedings of Machine Learning Research, pp. 18332–18346. PMLR, 2022. URL https://proceedings.mlr.press/v162/rajbhandari22a.html.
- Squad: 100,000+ questions for machine comprehension of text. ArXiv, abs/1606.05250, 2016.
- Distilbert, a distilled version of BERT: smaller, faster, cheaper and lighter. CoRR, abs/1910.01108, 2019. URL http://arxiv.org/abs/1910.01108.
- Sponge examples: Energy-latency attacks on neural networks. In 2021 IEEE European symposium on security and privacy (EuroS&P), pp. 212–231. IEEE, 2021.
- Energy and policy considerations for deep learning in NLP. In Korhonen, A., Traum, D. R., and Màrquez, L. (eds.), Proceedings of the 57th Conference of the Association for Computational Linguistics, ACL 2019, Florence, Italy, July 28- August 2, 2019, Volume 1: Long Papers, pp. 3645–3650. Association for Computational Linguistics, 2019. doi: 10.18653/v1/p19-1355. URL https://doi.org/10.18653/v1/p19-1355.
- A simple and effective pruning approach for large language models. CoRR, abs/2306.11695, 2023a. doi: 10.48550/arXiv.2306.11695. URL https://doi.org/10.48550/arXiv.2306.11695.
- A simple and effective pruning approach for large language models, 2023b.
- Pruning neural networks without any data by iteratively conserving synaptic flow. In Larochelle, H., Ranzato, M., Hadsell, R., Balcan, M., and Lin, H. (eds.), Advances in Neural Information Processing Systems 33: Annual Conference on Neural Information Processing Systems 2020, NeurIPS 2020, December 6-12, 2020, virtual, 2020. URL https://proceedings.neurips.cc/paper/2020/hash/46a4378f835dc8040c8057beb6a2da52-Abstract.html.
- Analyzing multi-head self-attention: Specialized heads do the heavy lifting, the rest can be pruned. In Korhonen, A., Traum, D. R., and Màrquez, L. (eds.), Proceedings of the 57th Conference of the Association for Computational Linguistics, ACL 2019, Florence, Italy, July 28- August 2, 2019, Volume 1: Long Papers, pp. 5797–5808. Association for Computational Linguistics, 2019. doi: 10.18653/v1/p19-1580. URL https://doi.org/10.18653/v1/p19-1580.
- GLUE: A multi-task benchmark and analysis platform for natural language understanding. In Proceedings of the 2018 EMNLP Workshop BlackboxNLP: Analyzing and Interpreting Neural Networks for NLP, pp. 353–355, Brussels, Belgium, November 2018. Association for Computational Linguistics. doi: 10.18653/v1/W18-5446. URL https://aclanthology.org/W18-5446.
- Picking winning tickets before training by preserving gradient flow. In 8th International Conference on Learning Representations, ICLR 2020, Addis Ababa, Ethiopia, April 26-30, 2020. OpenReview.net, 2020. URL https://openreview.net/forum?id=SkgsACVKPH.
- Transformers: State-of-the-art natural language processing. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp. 38–45, Online, October 2020. Association for Computational Linguistics. doi: 10.18653/v1/2020.emnlp-demos.6. URL https://aclanthology.org/2020.emnlp-demos.6.
- GOBO: quantizing attention-based NLP models for low latency and energy efficient inference. In 53rd Annual IEEE/ACM International Symposium on Microarchitecture, MICRO 2020, Athens, Greece, October 17-21, 2020, pp. 811–824. IEEE, 2020. doi: 10.1109/MICRO50266.2020.00071. URL https://doi.org/10.1109/MICRO50266.2020.00071.
- Opt: Open pre-trained transformer language models. ArXiv, abs/2205.01068, 2022. URL https://api.semanticscholar.org/CorpusID:248496292.
- Moefication: Transformer feed-forward layers are mixtures of experts. In Findings, 2021.
- Bharat Runwal (6 papers)
- Tejaswini Pedapati (31 papers)
- Pin-Yu Chen (311 papers)