A Stability Analysis of Fine-Tuning a Pre-Trained Model (2301.09820v2)
Abstract: Fine-tuning a pre-trained model (such as BERT, ALBERT, RoBERTa, T5, GPT, etc.) has proven to be one of the most promising paradigms in recent NLP research. However, numerous recent works indicate that fine-tuning suffers from the instability problem, i.e., tuning the same model under the same setting results in significantly different performance. Many recent works have proposed different methods to solve this problem, but there is no theoretical understanding of why and how these methods work. In this paper, we propose a novel theoretical stability analysis of fine-tuning that focuses on two commonly used settings, namely, full fine-tuning and head tuning. We define the stability under each setting and prove the corresponding stability bounds. The theoretical bounds explain why and how several existing methods can stabilize the fine-tuning procedure. In addition to being able to explain most of the observed empirical discoveries, our proposed theoretical analysis framework can also help in the design of effective and provable methods. Based on our theory, we propose three novel strategies to stabilize the fine-tuning procedure, namely, Maximal Margin Regularizer (MMR), Multi-Head Loss (MHLoss), and Self Unsupervised Re-Training (SURT). We extensively evaluate our proposed approaches on 11 widely used real-world benchmark datasets, as well as hundreds of synthetic classification datasets. The experiment results show that our proposed methods significantly stabilize the fine-tuning procedure and also corroborate our theoretical analysis.
- {{\{{TensorFlow}}\}}: A system for {{\{{Large-Scale}}\}} machine learning. In 12th USENIX symposium on operating systems design and implementation (OSDI 16), pages 265–283.
- Muppet: Massive multi-task representations with pre-finetuning. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 5799–5811.
- Better fine-tuning by reducing representational collapse. In International Conference on Learning Representations.
- Stronger generalization bounds for deep nets via a compression approach. In International Conference on Machine Learning, pages 254–263. PMLR.
- David Barrett and Benoit Dherin. 2021. Implicit gradient regularization. In International Conference on Learning Representations.
- Reconciling modern machine-learning practice and the classical bias–variance trade-off. Proceedings of the National Academy of Sciences, 116(32):15849–15854.
- The fifth pascal recognizing textual entailment challenge. In TAC.
- Christopher M Bishop and Nasser M Nasrabadi. 2006. Pattern recognition and machine learning, volume 4. Springer.
- Olivier Bousquet and André Elisseeff. 2002. Stability and generalization. The Journal of Machine Learning Research, 2:499–526.
- Convex optimization. Cambridge university press.
- Jax: composable transformations of python+ numpy programs.
- On the usability of transformers-based models for a french question-answering task. In Proceedings of the International Conference on Recent Advances in Natural Language Processing (RANLP 2021), pages 244–255.
- On the usability of transformers-based models for a french question-answering task. arXiv preprint arXiv:2207.09150.
- Zachary Charles and Dimitris Papailiopoulos. 2018. Stability and generalization of learning algorithms that converge to global optima. In International conference on machine learning, pages 745–754. PMLR.
- An empirical study of training self-supervised vision transformers. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 9640–9649.
- Boolq: Exploring the surprising difficulty of natural yes/no questions. In NAACL-HLT (1).
- The pascal recognising textual entailment challenge. In Machine Learning Challenges Workshop, pages 177–190. Springer.
- The commitmentbank: Investigating projection in naturally occurring discourse. In proceedings of Sinn und Bedeutung, volume 23, pages 107–124.
- Bert: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of NAACL-HLT, pages 4171–4186.
- Fine-tuning pretrained language models: Weight initializations, data orders, and early stopping.
- Bill Dolan and Chris Brockett. 2005. Automatically constructing a corpus of sentential paraphrases. In Third International Workshop on Paraphrasing (IWP2005).
- Sharpness-aware minimization for efficiently improving generalization. In International Conference on Learning Representations.
- On the effectiveness of parameter-efficient fine-tuning. arXiv preprint arXiv:2211.15583.
- The third pascal recognizing textual entailment challenge. In Proceedings of the ACL-PASCAL workshop on textual entailment and paraphrasing, pages 1–9.
- Don’t stop pretraining: Adapt language models to domains and tasks. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 8342–8360.
- Robust transfer learning with pretrained language models through adapters. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 2: Short Papers), pages 854–861.
- Train faster, generalize better: Stability of stochastic gradient descent. In International conference on machine learning, pages 1225–1234. PMLR.
- On the effectiveness of adapter-based tuning for pretrained language model adaptation. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pages 2208–2222.
- Using pre-training can improve model robustness and uncertainty. In International Conference on Machine Learning, pages 2712–2721. PMLR.
- Sparsity in deep learning: Pruning and growth for efficient inference and training in neural networks. The Journal of Machine Learning Research, 22(1):10882–11005.
- Parameter-efficient transfer learning for nlp. In International Conference on Machine Learning, pages 2790–2799. PMLR.
- Noise stability regularization for improving bert fine-tuning. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 3229–3241.
- Improving transformer optimization through better initialization. In International Conference on Machine Learning, pages 4475–4483. PMLR.
- Do we need zero training loss after achieving zero training error? In Proceedings of the 37th International Conference on Machine Learning, pages 4604–4614.
- Ziwei Ji and Matus Telgarsky. 2019. The implicit bias of gradient descent on nonseparable data. In Conference on Learning Theory, pages 1772–1798. PMLR.
- Looking beyond the surface: A challenge set for reading comprehension over multiple sentences. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), pages 252–262.
- Jonas Moritz Kohler and Aurelien Lucchi. 2017. Sub-sampled cubic regularization for non-convex optimization. In International Conference on Machine Learning, pages 1895–1904. PMLR.
- Fine-tuning can distort pretrained features and underperform out-of-distribution. In International Conference on Learning Representations.
- Ilja Kuzborskij and Christoph Lampert. 2018. Data-dependent stability of stochastic gradient descent. In International Conference on Machine Learning, pages 2815–2824. PMLR.
- Albert: A lite bert for self-supervised learning of language representations. In International Conference on Learning Representations.
- Mixout: Effective regularization to finetune large-scale pretrained language models. In International Conference on Learning Representations.
- Yunwen Lei and Yiming Ying. 2020. Fine-grained analysis of stability and generalization for stochastic gradient descent. In International Conference on Machine Learning, pages 5809–5819. PMLR.
- The winograd schema challenge. In Thirteenth international conference on the principles of knowledge representation and reasoning.
- P-tuning v2: Prompt tuning can be comparable to fine-tuning universally across scales and tasks. arXiv preprint arXiv:2110.07602.
- Towards robust neural networks via random self-ensemble. In Proceedings of the European Conference on Computer Vision (ECCV), pages 369–385.
- Roberta: A robustly optimized bert pretraining approach.
- On the stability of fine-tuning bert: Misconceptions, explanations, and strong baselines. In International Conference on Learning Representations.
- Deep double descent: Where bigger models and more data hurt. Journal of Statistical Mechanics: Theory and Experiment, 2021(12):124003.
- Yurii Nesterov et al. 2018. Lectures on convex optimization, volume 137. Springer.
- Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems, 32.
- Deep contextualized word representations. In NAACL.
- To tune or not to tune? adapting pretrained representations to diverse tasks. In Proceedings of the 4th Workshop on Representation Learning for NLP (RepL4NLP-2019), pages 7–14.
- Sentence encoders on stilts: Supplementary training on intermediate labeled-data tasks.
- jiant 2.0: A software toolkit for research on general-purpose text understanding models. http://jiant.info/.
- Mohammad Taher Pilehvar and Jose Camacho-Collados. 2019. Wic: the word-in-context dataset for evaluating context-sensitive meaning representations. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 1267–1273.
- Intermediate-task transfer learning with pretrained language models: When and why does it work? In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 5231–5247.
- Evani Radiya-Dixit and Xin Wang. 2020. How fine can fine-tuning be? learning efficient language models. In International Conference on Artificial Intelligence and Statistics, pages 2435–2443. PMLR.
- Stable rank normalization for improved generalization in neural networks and gans. In International Conference on Learning Representations.
- Matan Schliserman and Tomer Koren. 2022. Stability vs implicit bias of gradient methods on separable data and beyond. In Proceedings of Thirty Fifth Conference on Learning Theory, volume 178 of Proceedings of Machine Learning Research, pages 3380–3394. PMLR.
- Shai Shalev-Shwartz and Shai Ben-David. 2014. Understanding machine learning: From theory to algorithms. Cambridge university press.
- Learnability, stability and uniform convergence. The Journal of Machine Learning Research, 11:2635–2670.
- Gradient matching for domain generalization. In International Conference on Learning Representations.
- On the origin of implicit regularization in stochastic gradient descent. In International Conference on Learning Representations.
- Recursive deep models for semantic compositionality over a sentiment treebank. In Proceedings of the 2013 conference on empirical methods in natural language processing, pages 1631–1642.
- The implicit bias of gradient descent on separable data. The Journal of Machine Learning Research, 19(1):2822–2878.
- Training neural networks with fixed sparse masks. Advances in Neural Information Processing Systems, 34.
- Superglue: A stickier benchmark for general-purpose language understanding systems. Advances in neural information processing systems, 32.
- Glue: A multi-task benchmark and analysis platform for natural language understanding. In Proceedings of the 2018 EMNLP Workshop BlackboxNLP: Analyzing and Interpreting Neural Networks for NLP, pages 353–355.
- Neural network acceptability judgments. Transactions of the Association for Computational Linguistics, 7:625–641.
- Why do pretrained language models help in downstream tasks? an analysis of head and prompt tuning. Advances in Neural Information Processing Systems, 34:16158–16170.
- How does sharpness-aware minimization minimizes sharpness? In OPT 2022: Optimization for Machine Learning (NeurIPS 2022 Workshop).
- Transformers: State-of-the-art natural language processing. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pages 38–45, Online. Association for Computational Linguistics.
- Raise a child in large language model: Towards effective and generalizable fine-tuning. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 9514–9528.
- Chenghao Yang and Xuezhe Ma. 2022. Improving stability of fine-tuning pretrained language models via component-wise gradient norm clipping. arXiv preprint arXiv:2210.10325.
- Bitfit: Simple parameter-efficient fine-tuning for transformer-based masked language-models. arXiv preprint arXiv:2106.10199.
- Revisiting few-sample bert fine-tuning. In International Conference on Learning Representations.
- Calibrate before use: Improving few-shot performance of language models. In International Conference on Machine Learning, pages 12697–12706. PMLR.
- Freelb: Enhanced adversarial training for natural language understanding. In ICLR.
- Moebert: from bert to mixture-of-experts via importance-guided adaptation. In Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 1610–1623.
- Zihao Fu (17 papers)
- Anthony Man-Cho So (97 papers)
- Nigel Collier (83 papers)