Unveiling the Generalization Power of Fine-Tuned Large Language Models (2403.09162v1)
Abstract: While LLMs have demonstrated exceptional multitasking abilities, fine-tuning these models on downstream, domain-specific datasets is often necessary to yield superior performance on test sets compared to their counterparts without fine-tuning. However, the comprehensive effects of fine-tuning on the LLMs' generalization ability are not fully understood. This paper delves into the differences between original, unmodified LLMs and their fine-tuned variants. Our primary investigation centers on whether fine-tuning affects the generalization ability intrinsic to LLMs. To elaborate on this, we conduct extensive experiments across five distinct language tasks on various datasets. Our main findings reveal that models fine-tuned on generation and classification tasks exhibit dissimilar behaviors in generalizing to different domains and tasks. Intriguingly, we observe that integrating the in-context learning strategy during fine-tuning on generation tasks can enhance the model's generalization ability. Through this systematic investigation, we aim to contribute valuable insights into the evolving landscape of fine-tuning practices for LLMs.
- 2023. Llama 2: Open foundation and fine-tuned chat models.
- Exploring length generalization in large language models. In Advances in Neural Information Processing Systems.
- Language models are few-shot learners. In Advances in Neural Information Processing Systems, volume 33, pages 1877–1901. Curran Associates, Inc.
- Recall and learn: Fine-tuning deep pretrained language models with less forgetting. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 7870–7881, Online. Association for Computational Linguistics.
- Palm: Scaling language modeling with pathways.
- BERT: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 4171–4186, Minneapolis, Minnesota. Association for Computational Linguistics.
- Generating abstractive summaries with finetuned language models. In Proceedings of the 12th International Conference on Natural Language Generation, pages 516–522, Tokyo, Japan. Association for Computational Linguistics.
- The false promise of imitating proprietary llms.
- Understanding in-context learning via supportive pretraining data. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 12660–12673, Toronto, Canada. Association for Computational Linguistics.
- XL-sum: Large-scale multilingual abstractive summarization for 44 languages. In Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021, pages 4693–4703, Online. Association for Computational Linguistics.
- Teaching machines to read and comprehend. In Proceedings of the 28th International Conference on Neural Information Processing Systems-Volume 1, pages 1693–1701.
- Is chatgpt a good translator? yes with gpt-4 as the engine.
- A dataset of peer reviews (peerread): Collection, insights and nlp applications. In Meeting of the North American Chapter of the Association for Computational Linguistics (NAACL), New Orleans, USA.
- The multilingual Amazon reviews corpus. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 4563–4568, Online. Association for Computational Linguistics.
- In-context learning learns label relationships but is not conventional learning.
- Pre-train, prompt, and predict: A systematic survey of prompting methods in natural language processing. ACM Comput. Surv., 55(9).
- P-tuning: Prompt tuning can be comparable to fine-tuning across scales and tasks. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pages 61–68, Dublin, Ireland. Association for Computational Linguistics.
- Roberta: A robustly optimized bert pretraining approach.
- Ilya Loshchilov and Frank Hutter. 2019. Decoupled weight decay regularization. In International Conference on Learning Representations.
- Julian John McAuley and Jure Leskovec. 2013. From amateurs to connoisseurs: Modeling the evolution of user expertise through online reviews. In Proceedings of the 22nd International Conference on World Wide Web, WWW ’13, page 897–908, New York, NY, USA. Association for Computing Machinery.
- Few-shot fine-tuning vs. in-context learning: A fair comparison and evaluation. In Findings of the Association for Computational Linguistics: ACL 2023, pages 12284–12314, Toronto, Canada. Association for Computational Linguistics.
- Don’t give me the details, just the summary! Topic-aware convolutional neural networks for extreme summarization. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, Brussels, Belgium.
- OpenAI. 2023. Gpt-4 technical report.
- Language models are unsupervised multitask learners.
- Learning to retrieve prompts for in-context learning. In Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 2655–2671, Seattle, United States. Association for Computational Linguistics.
- SocialIQA: Commonsense reasoning about social interactions. In EMNLP.
- Timo Schick and Hinrich Schütze. 2021. It’s not just size that matters: Small language models are also few-shot learners. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 2339–2352, Online. Association for Computational Linguistics.
- Specialist or generalist? instruction tuning for specific nlp tasks.
- A survey of reasoning with foundation models. arXiv preprint arXiv:2312.11562.
- GLUE: A multi-task benchmark and analysis platform for natural language understanding. In Proceedings of the 2018 EMNLP Workshop BlackboxNLP: Analyzing and Interpreting Neural Networks for NLP, pages 353–355, Brussels, Belgium. Association for Computational Linguistics.
- GLUE: A multi-task benchmark and analysis platform for natural language understanding. In International Conference on Learning Representations.
- Instructuie: Multi-task instruction tuning for unified information extraction.
- Hint-enhanced in-context learning wakes large language models up for knowledge-intensive tasks. arXiv preprint arXiv:2311.01949.
- Two-stage llm fine-tuning with less specialization and more generalization.
- Finetuned language models are zero-shot learners. In International Conference on Learning Representations.
- Chain of thought prompting elicits reasoning in large language models. In Advances in Neural Information Processing Systems.
- Crowdsourcing multiple choice science questions. ArXiv, abs/1707.06209.
- A broad-coverage challenge corpus for sentence understanding through inference. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), pages 1112–1122, New Orleans, Louisiana. Association for Computational Linguistics.
- Tweetqa: A social media focused question answering dataset. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics.
- Cosplay: Concept set guided personalized dialogue generation across both party personas. In Proceedings of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 201–211.
- Contrastive representation learning for exemplar-guided paraphrase generation. In Findings of the Association for Computational Linguistics: EMNLP 2021, pages 4754–4761, Punta Cana, Dominican Republic. Association for Computational Linguistics.
- PAWS: Paraphrase adversaries from word scrambling. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 1298–1308, Minneapolis, Minnesota. Association for Computational Linguistics.
- Multi-task instruction tuning of llama for specific scenarios: A preliminary study on writing assistance.
- Lima: Less is more for alignment.
- Haoran Yang (39 papers)
- Yumeng Zhang (35 papers)
- Jiaqi Xu (49 papers)
- Hongyuan Lu (18 papers)
- Pheng Ann Heng (24 papers)
- Wai Lam (117 papers)