Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
41 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Task-Specific Skill Localization in Fine-tuned Language Models (2302.06600v2)

Published 13 Feb 2023 in cs.CL and cs.LG

Abstract: Pre-trained LLMs can be fine-tuned to solve diverse NLP tasks, including in few-shot settings. Thus fine-tuning allows the model to quickly pick up task-specific ``skills,'' but there has been limited study of where these newly-learnt skills reside inside the massive model. This paper introduces the term skill localization for this problem and proposes a solution. Given the downstream task and a model fine-tuned on that task, a simple optimization is used to identify a very small subset of parameters ($\sim0.01$% of model parameters) responsible for ($>95$%) of the model's performance, in the sense that grafting the fine-tuned values for just this tiny subset onto the pre-trained model gives performance almost as well as the fine-tuned model. While reminiscent of recent works on parameter-efficient fine-tuning, the novel aspects here are that: (i) No further re-training is needed on the subset (unlike, say, with lottery tickets). (ii) Notable improvements are seen over vanilla fine-tuning with respect to calibration of predictions in-distribution ($40$-$90$% error reduction) as well as the quality of predictions out-of-distribution (OOD). In models trained on multiple tasks, a stronger notion of skill localization is observed, where the sparse regions corresponding to different tasks are almost disjoint, and their overlap (when it happens) is a proxy for task similarity. Experiments suggest that localization via grafting can assist certain forms of continual learning.

Task-Specific Skill Localization in Fine-tuned LLMs

Overview

The paper "Task-Specific Skill Localization in Fine-tuned LLMs" addresses the phenomenon of skill acquisition in pre-trained LLMs following fine-tuning for NLP tasks. The focus is on identifying the specific subset of parameters responsible for performing well in a fine-tuned model, termed as skill localization. The approach proposed is intriguing, involving an optimization method that identifies these critical parameters, the so-called 'grafting region', which constitutes about 0.01% of the entire model's parameters yet accounts for most of the model's task performance.

Methodology

The core innovation lies in the concept of model grafting, where skill localization is achieved without additional re-training. The grafting involves copying the values of identified sparse parameters from the fine-tuned model to the pre-trained model, essentially localizing the skill. This results in nearly equivalent performance with significant improvements observed in calibration errors (40%-90% reduction) and out-of-distribution predictions without the susceptibility to catastrophic forgetting. The implications extend to multi-task and continual learning, where disjoint parameter subsets emerge for different tasks, indicating a potential method to transfer skills across related tasks.

Numerical Results

Key quantitative results include the findings that sparse graft regions consisting of only 0.01% of model parameters retain over 95% efficiency compared to fully fine-tuned models, with noticeable calibration and generalization benefits. For models optimized across multiple tasks, these regions show minimal overlap, suggesting a clear demarcation of task-specific parameters, a revelation suggesting compositional skill capabilities of models.

Implications and Future Directions

The implications of this research are profound in parameter-efficient model deployment, potentially reducing the computational and storage overhead required for fine-tuning. The observed calibration improvements may have practical applications in safety-critical NLP tasks, where confidence calibration is crucial. This work opens up new avenues in transferring learned capabilities through minimal parameter grafting, offering significant insights into model interpretability and explainability. The identified disjoint task-centric regions provide a new lens to explore modular task learning and potentially robust continual learning frameworks without forgetting learned skills. Future research could explore deeper the underlying mechanisms of skill localization and extend this method to other model architectures and domains beyond NLP to offer generalized solutions for efficient model training and deployment.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (56)
  1. Stronger generalization bounds for deep nets via a compression approach. In International Conference on Machine Learning. PMLR, 2018.
  2. Revisiting model stitching to compare neural representations. In Beygelzimer, A., Dauphin, Y., Liang, P., and Vaughan, J. W. (eds.), Advances in Neural Information Processing Systems, 2021. URL https://openreview.net/forum?id=ak06J5jNR4.
  3. BitFit: Simple parameter-efficient fine-tuning for transformer-based masked language-models. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), Dublin, Ireland, May 2022. Association for Computational Linguistics. doi: 10.18653/v1/2022.acl-short.1. URL https://aclanthology.org/2022.acl-short.1.
  4. Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901, 2020.
  5. Discovering latent knowledge in language models without supervision. arXiv preprint arXiv:2212.03827, 2022.
  6. The lottery ticket hypothesis for pre-trained bert networks. Advances in neural information processing systems, 33:15834–15846, 2020.
  7. Lifelong machine learning. Synthesis Lectures on Artificial Intelligence and Machine Learning, 2018.
  8. A unified architecture for natural language processing: Deep neural networks with multitask learning. In Proceedings of the 25th international conference on Machine learning, pp.  160–167, 2008.
  9. Knowledge neurons in pretrained transformers. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp.  8493–8502, Dublin, Ireland, May 2022. Association for Computational Linguistics. doi: 10.18653/v1/2022.acl-long.581. URL https://aclanthology.org/2022.acl-long.581.
  10. BERT: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), Minneapolis, Minnesota, 2019. Association for Computational Linguistics. URL https://aclanthology.org/N19-1423.
  11. A winning hand: Compressing deep networks can improve out-of-distribution robustness. Advances in Neural Information Processing Systems, 34:664–676, 2021.
  12. The lottery ticket hypothesis: Finding sparse, trainable neural networks. In International Conference on Learning Representations, 2018.
  13. Making pre-trained language models better few-shot learners. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers). Association for Computational Linguistics, 2021a. doi: 10.18653/v1/2021.acl-long.295. URL https://aclanthology.org/2021.acl-long.295.
  14. Adapting by pruning: A case study on bert. arXiv preprint arXiv:2105.03343, 2021b.
  15. Finding the dominant winning ticket in pre-trained language models. In Findings of the Association for Computational Linguistics: ACL 2022, pp.  1459–1472, Dublin, Ireland, May 2022. Association for Computational Linguistics. doi: 10.18653/v1/2022.findings-acl.115. URL https://aclanthology.org/2022.findings-acl.115.
  16. Compressing BERT: Studying the effects of weight pruning on transfer learning. In Proceedings of the 5th Workshop on Representation Learning for NLP, pp.  143–155, Online, July 2020. Association for Computational Linguistics. doi: 10.18653/v1/2020.repl4nlp-1.18. URL https://aclanthology.org/2020.repl4nlp-1.18.
  17. On calibration of modern neural networks. In International conference on machine learning, pp. 1321–1330. PMLR, 2017.
  18. Table filling multi-task recurrent neural network for joint entity and relation extraction. In Proceedings of COLING 2016, the 26th International Conference on Computational Linguistics: Technical Papers, pp.  2537–2547, Osaka, Japan, December 2016. The COLING 2016 Organizing Committee. URL https://aclanthology.org/C16-1239.
  19. Does localization inform editing? surprising differences in causality-based localization vs. knowledge editing in language models. arXiv preprint arXiv:2301.04213, 2023.
  20. Pretrained transformers improve out-of-distribution robustness. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pp.  2744–2751, Online, July 2020. Association for Computational Linguistics. doi: 10.18653/v1/2020.acl-main.244. URL https://aclanthology.org/2020.acl-main.244.
  21. Surface form competition: Why the highest probability answer isn’t always right. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pp.  7038–7051, Online and Punta Cana, Dominican Republic, November 2021. Association for Computational Linguistics. doi: 10.18653/v1/2021.emnlp-main.564. URL https://aclanthology.org/2021.emnlp-main.564.
  22. Parameter-efficient transfer learning for nlp. In International Conference on Machine Learning, pp. 2790–2799. PMLR, 2019.
  23. LoRA: Low-rank adaptation of large language models. In International Conference on Learning Representations, 2022. URL https://openreview.net/forum?id=nZeVKeeFYf9.
  24. Sgd on neural networks learns functions of increasing complexity. Advances in neural information processing systems, 2019.
  25. Forget-free continual learning with winning subnetworks. In International Conference on Machine Learning. PMLR, 2022.
  26. Verified uncertainty calibration. Advances in Neural Information Processing Systems, 32, 2019.
  27. How to fine-tune vision models with sgd. arXiv preprint arXiv:2211.09359, 2022.
  28. Surgical fine-tuning improves adaptation to distribution shifts. In The Eleventh International Conference on Learning Representations, 2023. URL https://openreview.net/forum?id=APuPRxjHvZ.
  29. Prefix-tuning: Optimizing continuous prompts for generation. arXiv preprint arXiv:2101.00190, 2021.
  30. Large models are parsimonious learners: Activation sparsity in trained transformers. arXiv preprint arXiv:2210.06313, 2022.
  31. Super tickets in pre-trained language models: From model compression to improving generalization. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pp.  6524–6538, Online, August 2021. Association for Computational Linguistics. doi: 10.18653/v1/2021.acl-long.510. URL https://aclanthology.org/2021.acl-long.510.
  32. Recurrent neural network for text classification with multi-task learning. In Proceedings of the Twenty-Fifth International Joint Conference on Artificial Intelligence, pp.  2873–2879, 2016.
  33. Multi-task deep neural networks for natural language understanding. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pp.  4487–4496, Florence, Italy, July 2019a. Association for Computational Linguistics. doi: 10.18653/v1/P19-1441. URL https://aclanthology.org/P19-1441.
  34. Roberta: A robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692, 2019b.
  35. A win-win deal: Towards sparse and robust pre-trained language models. In Koyejo, S., Mohamed, S., Agarwal, A., Belgrave, D., Cho, K., and Oh, A. (eds.), Advances in Neural Information Processing Systems, volume 35, pp.  19189–19202. Curran Associates, Inc., 2022.
  36. Learning multiple tasks with multilinear relationship networks. Advances in neural information processing systems, 30, 2017.
  37. Decoupled weight decay regularization. In International Conference on Learning Representations, 2019. URL https://openreview.net/forum?id=Bkg6RiCqY7.
  38. Fully-adaptive feature sharing in multi-task networks with applications in person attribute classification. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp.  5334–5343, 2017.
  39. A kernel-based view of language model fine-tuning. arXiv preprint arXiv:2210.05643, 2022.
  40. Locating and editing factual associations in GPT. In Oh, A. H., Agarwal, A., Belgrave, D., and Cho, K. (eds.), Advances in Neural Information Processing Systems, 2022. URL https://openreview.net/forum?id=-h6WAS6eE4.
  41. Cross-stitch networks for multi-task learning. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp.  3994–4003, 2016.
  42. Obtaining well calibrated probabilities using bayesian binning. In Twenty-Ninth AAAI Conference on Artificial Intelligence, 2015.
  43. Uniform convergence may be unable to explain generalization in deep learning. In Advances in Neural Information Processing Systems, 2019.
  44. In defense of uniform convergence: Generalization via derandomization with an application to interpolating predictors. In International Conference on Machine Learning. PMLR, 2020.
  45. AdapterFusion: Non-destructive task composition for transfer learning. In Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume, pp. 487–503, Online, April 2021. Association for Computational Linguistics. doi: 10.18653/v1/2021.eacl-main.39. URL https://aclanthology.org/2021.eacl-main.39.
  46. When BERT Plays the Lottery, All Tickets Are Winning. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp.  3208–3229, Online, November 2020. Association for Computational Linguistics. doi: 10.18653/v1/2020.emnlp-main.259. URL https://aclanthology.org/2020.emnlp-main.259.
  47. Language models are unsupervised multitask learners. OpenAI blog, 2019.
  48. A mathematical exploration of why language models help solve downstream tasks. In International Conference on Learning Representations, 2021.
  49. Training neural networks with fixed sparse masks. Advances in Neural Information Processing Systems, 34:24193–24205, 2021.
  50. GLUE: A multi-task benchmark and analysis platform for natural language understanding. In Proceedings of the 2018 EMNLP Workshop BlackboxNLP: Analyzing and Interpreting Neural Networks for NLP, pp.  353–355, Brussels, Belgium, November 2018. Association for Computational Linguistics. doi: 10.18653/v1/W18-5446. URL https://aclanthology.org/W18-5446.
  51. Finding skill neurons in pre-trained transformer-based language models. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pp.  11132–11152, Abu Dhabi, United Arab Emirates, December 2022. Association for Computational Linguistics. URL https://aclanthology.org/2022.emnlp-main.765.
  52. Why do pretrained language models help in downstream tasks? an analysis of head and prompt tuning. Advances in Neural Information Processing Systems, 2021.
  53. Robust fine-tuning of zero-shot models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.  7959–7971, 2022.
  54. Hidden state variability of pretrained language models can guide computation reduction for transfer learning. In Findings of the Association for Computational Linguistics: EMNLP 2022, pp.  5750–5768, Abu Dhabi, United Arab Emirates, December 2022. Association for Computational Linguistics. URL https://aclanthology.org/2022.findings-emnlp.422.
  55. Gradient surgery for multi-task learning. Advances in Neural Information Processing Systems, 2020.
  56. Can subnetwork structure be the key to out-of-distribution generalization? In International Conference on Machine Learning. PMLR, 2021.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (4)
  1. Abhishek Panigrahi (17 papers)
  2. Nikunj Saunshi (23 papers)
  3. Haoyu Zhao (41 papers)
  4. Sanjeev Arora (93 papers)
Citations (57)
Github Logo Streamline Icon: https://streamlinehq.com