LayerNorm: A key component in parameter-efficient fine-tuning (2403.20284v1)
Abstract: Fine-tuning a pre-trained model, such as Bidirectional Encoder Representations from Transformers (BERT), has been proven to be an effective method for solving many NLP tasks. However, due to the large number of parameters in many state-of-the-art NLP models, including BERT, the process of fine-tuning is computationally expensive. One attractive solution to this issue is parameter-efficient fine-tuning, which involves modifying only a minimal segment of the model while keeping the remainder unchanged. Yet, it remains unclear which segment of the BERT model is crucial for fine-tuning. In this paper, we first analyze different components in the BERT model to pinpoint which one undergoes the most significant changes after fine-tuning. We find that output LayerNorm changes more than any other components when fine-tuned for different General Language Understanding Evaluation (GLUE) tasks. Then we show that only fine-tuning the LayerNorm can reach comparable, or in some cases better, performance to full fine-tuning and other parameter-efficient fine-tuning methods. Moreover, we use Fisher information to determine the most critical subset of LayerNorm and demonstrate that many NLP tasks in the GLUE benchmark can be solved by fine-tuning only a small portion of LayerNorm with negligible performance degradation.
- Glue. https://gluebenchmark.com/, 2021. Accessed: 2021-11-03.
- Better fine-tuning by reducing representational collapse. In International Conference on Learning Representations, 2021. URL https://openreview.net/forum?id=OQ08SN70M1V.
- Layer normalization, 2016a.
- Layer normalization, 2016b.
- The second pascal recognising textual entailment challenge. 2006.
- The fifth pascal recognizing textual entailment challenge. In In Proc Text Analysis Conference (TAC’09, 2009.
- SemEval-2017 task 1: Semantic textual similarity multilingual and crosslingual focused evaluation. In Proceedings of the 11th International Workshop on Semantic Evaluation (SemEval-2017), pp. 1–14, Vancouver, Canada, August 2017. Association for Computational Linguistics. doi:10.18653/v1/S17-2001. URL https://aclanthology.org/S17-2001.
- The lottery ticket hypothesis for pre-trained bert networks, 2020.
- The pascal recognising textual entailment challenge. In Joaquin Quiñonero-Candela, Ido Dagan, Bernardo Magnini, and Florence d’Alché Buc (eds.), Machine Learning Challenges. Evaluating Predictive Uncertainty, Visual Object Classification, and Recognising Tectual Entailment, pp. 177–190, Berlin, Heidelberg, 2006. Springer Berlin Heidelberg. ISBN 978-3-540-33428-6.
- BERT: pre-training of deep bidirectional transformers for language understanding. CoRR, abs/1810.04805, 2018. URL http://arxiv.org/abs/1810.04805.
- BERT: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pp. 4171–4186, Minneapolis, Minnesota, June 2019. Association for Computational Linguistics. doi:10.18653/v1/N19-1423. URL https://aclanthology.org/N19-1423.
- Automatically constructing a corpus of sentential paraphrases. In Proceedings of the Third International Workshop on Paraphrasing (IWP2005), 2005. URL https://aclanthology.org/I05-5002.
- Critical initialization of wide and deep neural networks through partial jacobians: General theory and applications, 2023. URL https://openreview.net/forum?id=xb333aboIu.
- The third PASCAL recognizing textual entailment challenge. In Proceedings of the ACL-PASCAL Workshop on Textual Entailment and Paraphrasing, pp. 1–9, Prague, June 2007. Association for Computational Linguistics. URL https://aclanthology.org/W07-1401.
- Compressing BERT: Studying the effects of weight pruning on transfer learning. In Proceedings of the 5th Workshop on Representation Learning for NLP. Association for Computational Linguistics, 2020. doi:10.18653/v1/2020.repl4nlp-1.18. URL https://doi.org/10.18653/v1/2020.repl4nlp-1.18.
- Parameter-efficient transfer learning with diff pruning, 2020.
- Learning both weights and connections for efficient neural network. In Neural Information Processing Systems, 2015. URL https://api.semanticscholar.org/CorpusID:2238772.
- Optimal brain surgeon and general network pruning. In IEEE International Conference on Neural Networks, pp. 293–299 vol.1, 1993. doi:10.1109/ICNN.1993.298572.
- Parameter-efficient transfer learning for NLP. CoRR, abs/1902.00751, 2019a. URL http://arxiv.org/abs/1902.00751.
- Parameter-efficient transfer learning for nlp, 2019b.
- LoRA: Low-rank adaptation of large language models. In International Conference on Learning Representations, 2022. URL https://openreview.net/forum?id=nZeVKeeFYf9.
- Fedpara: Low-rank hadamard product for communication-efficient federated learning. In International Conference on Learning Representations, 2022. URL https://openreview.net/forum?id=d71n4ftoCBy.
- Batch normalization: Accelerating deep network training by reducing internal covariate shift. In Francis Bach and David Blei (eds.), Proceedings of the 32nd International Conference on Machine Learning, volume 37 of Proceedings of Machine Learning Research, pp. 448–456, Lille, France, 07–09 Jul 2015. PMLR. URL https://proceedings.mlr.press/v37/ioffe15.html.
- First quora dataset release: Question pairs., 2017.
- Overcoming catastrophic forgetting in neural networks. CoRR, abs/1612.00796, 2016. URL http://arxiv.org/abs/1612.00796.
- ELoRA: Efficient low-rank adaptation with random matrices. In The Twelfth International Conference on Learning Representations, 2024. URL https://openreview.net/forum?id=NjNfLdxr3A.
- Bert busters: Outlier dimensions that disrupt transformers, 2021.
- Use of ranks in one-criterion variance analysis. Journal of the American Statistical Association, 47(260):583–621, December 1952. doi:10.1080/01621459.1952.10483441. URL https://doi.org/10.1080/01621459.1952.10483441.
- Optimal brain damage. In D. Touretzky (ed.), Advances in Neural Information Processing Systems, volume 2. Morgan-Kaufmann, 1989. URL https://proceedings.neurips.cc/paper_files/paper/1989/file/6c9882bbac1c7093bd25041881277658-Paper.pdf.
- What would elsa do? freezing layers during transformer fine-tuning. ArXiv, abs/1911.03090, 2019. URL https://api.semanticscholar.org/CorpusID:207847573.
- The power of scale for parameter-efficient prompt tuning. In Marie-Francine Moens, Xuanjing Huang, Lucia Specia, and Scott Wen-tau Yih (eds.), Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pp. 3045–3059, Online and Punta Cana, Dominican Republic, November 2021. Association for Computational Linguistics. doi:10.18653/v1/2021.emnlp-main.243. URL https://aclanthology.org/2021.emnlp-main.243.
- The winograd schema challenge. In 13th International Conference on the Principles of Knowledge Representation and Reasoning, KR 2012, Proceedings of the International Conference on Knowledge Representation and Reasoning, pp. 552–561. Institute of Electrical and Electronics Engineers Inc., 2012. ISBN 9781577355601. 13th International Conference on the Principles of Knowledge Representation and Reasoning, KR 2012 ; Conference date: 10-06-2012 Through 14-06-2012.
- Dora: Weight-decomposed low-rank adaptation, 2024.
- Roberta: A robustly optimized bert pretraining approach, 2019.
- Variational information bottleneck for effective low-resource fine-tuning. In International Conference on Learning Representations, 2021. URL https://openreview.net/forum?id=kvhzKz-_DMF.
- Parameter-efficient multi-task fine-tuning for transformers via shared hypernetworks. In Annual Meeting of the Association for Computational Linguistics, 2021. URL https://api.semanticscholar.org/CorpusID:235309789.
- What happens to BERT embeddings during fine-tuning? CoRR, abs/2004.14448, 2020. URL https://arxiv.org/abs/2004.14448.
- Pointer sentinel mixture models, 2016.
- Are sixteen heads really better than one? In H. Wallach, H. Larochelle, A. Beygelzimer, F. d'Alché-Buc, E. Fox, and R. Garnett (eds.), Advances in Neural Information Processing Systems, volume 32. Curran Associates, Inc., 2019. URL https://proceedings.neurips.cc/paper/2019/file/2c601ad9d2ff9bc8b282670cdd54f69f-Paper.pdf.
- Distributed representations of words and phrases and their compositionality. In C. J. C. Burges, L. Bottou, M. Welling, Z. Ghahramani, and K. Q. Weinberger (eds.), Advances in Neural Information Processing Systems, volume 26. Curran Associates, Inc., 2013. URL https://proceedings.neurips.cc/paper/2013/file/9aa42b31882ec039965f3c4923ce901b-Paper.pdf.
- GloVe: Global vectors for word representation. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543, Doha, Qatar, October 2014. Association for Computational Linguistics. doi:10.3115/v1/D14-1162. URL https://aclanthology.org/D14-1162.
- When BERT plays the lottery, all tickets are winning. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP). Association for Computational Linguistics, 2020. doi:10.18653/v1/2020.emnlp-main.259. URL https://doi.org/10.18653/v1/2020.emnlp-main.259.
- How fine can fine-tuning be? learning efficient language models, 2020.
- SQuAD: 100,000+ questions for machine comprehension of text. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, pp. 2383–2392, Austin, Texas, November 2016. Association for Computational Linguistics. doi:10.18653/v1/D16-1264. URL https://aclanthology.org/D16-1264.
- Residual prompt tuning: Improving prompt tuning with residual reparameterization. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics, 2023.
- The principles of deep learning theory. CoRR, abs/2106.10165, 2021. URL https://arxiv.org/abs/2106.10165.
- Deep information propagation. In International Conference on Learning Representations, 2017. URL https://openreview.net/forum?id=H1W1UN9gg.
- Powernorm: rethinking batch normalization in transformers. In Proceedings of the 37th International Conference on Machine Learning, ICML’20. JMLR.org, 2020.
- Fine-tuning bert for automatic adme semantic labeling in fda drug labeling to enhance product-specific guidance assessment, 2022.
- Normformer: Improved transformer pretraining with extra normalization, 2022. URL https://openreview.net/forum?id=GMYWzWztDx5.
- Recursive deep models for semantic compositionality over a sentiment treebank. In Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing, pp. 1631–1642, Seattle, Washington, USA, October 2013. Association for Computational Linguistics. URL https://aclanthology.org/D13-1170.
- Transformer-based semantic segmentation for extraction of building footprints from very-high-resolution images. Sensors, 23(11), 2023. ISSN 1424-8220. doi:10.3390/s23115166. URL https://www.mdpi.com/1424-8220/23/11/5166.
- A simple and effective pruning approach for large language models. In The Twelfth International Conference on Learning Representations, 2024. URL https://openreview.net/forum?id=PxoFut3dWW.
- Reducing the model order of deep neural networks using information theory. In 2016 IEEE Computer Society Annual Symposium on VLSI (ISVLSI). IEEE, July 2016. doi:10.1109/isvlsi.2016.117. URL https://doi.org/10.1109/isvlsi.2016.117.
- PharmBERT: a domain-specific BERT model for drug labels. Briefings in Bioinformatics, 24(4):bbad226, 06 2023. ISSN 1477-4054. doi:10.1093/bib/bbad226. URL https://doi.org/10.1093/bib/bbad226.
- Attention is all you need. In I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett (eds.), Advances in Neural Information Processing Systems, volume 30. Curran Associates, Inc., 2017. URL https://proceedings.neurips.cc/paper/2017/file/3f5ee243547dee91fbd053c1c4a845aa-Paper.pdf.
- William Waggener. Pulse code modulation techniques : with applications in communications and data recording. Van Nostrand Reinhold, New York, 1995. ISBN 9780442014360.
- Glue: A multi-task benchmark and analysis platform for natural language understanding, 2019.
- Correct normalization matters: Understanding the effect of normalization on deep neural network models for click-through rate prediction. CoRR, abs/2006.12753, 2020. URL https://arxiv.org/abs/2006.12753.
- Neural Network Acceptability Judgments. Transactions of the Association for Computational Linguistics, 7:625–641, 09 2019. ISSN 2307-387X. doi:10.1162/tacl_a_00290. URL https://doi.org/10.1162/tacl_a_00290.
- Siheng Wei. Distantly supervision for relation extraction via layernorm gated recurrent neural networks. In 2021 2nd International Conference on Computing and Data Science (CDS), pp. 94–99, 2021. doi:10.1109/CDS52072.2021.00022.
- A broad-coverage challenge corpus for sentence understanding through inference. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), pp. 1112–1122, New Orleans, Louisiana, June 2018. Association for Computational Linguistics. doi:10.18653/v1/N18-1101. URL https://aclanthology.org/N18-1101.
- Structured pruning learns compact and accurate models. ArXiv, abs/2204.00408, 2022. URL https://api.semanticscholar.org/CorpusID:247922354.
- Sheared LLaMA: Accelerating language model pre-training via structured pruning. In Workshop on Advancing Neural Network Training: Computational Efficiency, Scalability, and Resource Optimization (WANT@NeurIPS 2023), 2023. URL https://openreview.net/forum?id=6s77hjBNfS.
- On layer normalization in the transformer architecture, 2020. URL https://openreview.net/forum?id=B1x8anVFPr.
- Understanding and improving layer normalization. CoRR, abs/1911.07013, 2019. URL http://arxiv.org/abs/1911.07013.
- Raise a child in large language model: Towards effective and generalizable fine-tuning, 2021.
- A mean field theory of batch normalization. In International Conference on Learning Representations, 2019a. URL https://openreview.net/forum?id=SyMDXnCcF7.
- Xlnet: Generalized autoregressive pretraining for language understanding, 2019b.
- Bitfit: Simple parameter-efficient fine-tuning for transformer-based masked language-models, 2021.
- Adalora: Adaptive budget allocation for parameter-efficient fine-tuning, 2023.
- Extremely small bert models from mixed-vocabulary training, 2019.
- Taha ValizadehAslani (4 papers)
- Hualou Liang (5 papers)