Balancing Speciality and Versatility: A Coarse to Fine Framework for Mitigating Catastrophic Forgetting in Large Language Models (2404.10306v6)
Abstract: Aligned LLMs showcase remarkable versatility, capable of handling diverse real-world tasks. Meanwhile, aligned LLMs are also expected to exhibit speciality, excelling in specific applications. However, fine-tuning with extra data, a common practice to gain speciality, often leads to catastrophic forgetting (CF) of previously acquired versatility, hindering the model's performance across diverse tasks. In response to this challenge, we propose CoFiTune, a coarse to fine framework in an attempt to strike the balance between speciality and versatility. At the coarse-grained level, an empirical tree-search algorithm is utilized to pinpoint and update specific modules that are crucial for speciality, while keeping other parameters frozen; at the fine-grained level, a soft-masking mechanism regulates the update to the LLMs, mitigating the CF issue without harming speciality. In an overall evaluation of both speciality and versatility, CoFiTune consistently outperforms baseline methods across diverse tasks and model scales. Compared to the full-parameter SFT, CoFiTune leads to about 14% versatility improvement and marginal speciality loss on a 13B model. Lastly, based on further analysis, we provide a speculative insight into the information forwarding process in LLMs, which helps explain the effectiveness of the proposed method. The code is available at https://github.com/rattlesnakey/CoFiTune.
- Gpt-4 technical report. CoRR, abs/2303.08774.
- Repulsive attention: Rethinking multi-head attention as bayesian inference. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 236–255.
- Rethinking the role of scale for in-context learning: An interpretability-based case study at 66 billion scale. In Annual Meeting of the Association for Computational Linguistics.
- A survey on evaluation of large language models. ACM Transactions on Intelligent Systems and Technology.
- FeiDa Chen. 2018. The legal consultation data and corpus of the thesis from China law network.Replication Data for: Design and research of legal consultation text classification system.
- Breaking language barriers in multilingual mathematical reasoning: Insights and observations. CoRR, abs/2310.20246.
- Disc-finllm: A chinese financial large language model based on multiple experts fine-tuning. CoRR, abs/2310.15205.
- Adapting large language models via reading comprehension. CoRR, abs/2309.09530.
- Palm: Scaling language modeling with pathways. Journal of Machine Learning Research, 24(240):1–113.
- Dola: Decoding by contrasting layers improves factuality in large language models. CoRR, abs/2309.03883.
- Training verifiers to solve math word problems. arXiv preprint arXiv:2110.14168.
- Chatlaw: Open-source legal large language model with integrated external knowledge bases. CoRR, abs/2306.16092.
- Efficient and effective text encoding for chinese llama and alpaca. CoRR, abs/2304.08177.
- Knowledge neurons in pretrained transformers. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 8493–8502.
- Episodic memory in lifelong language learning. Advances in Neural Information Processing Systems, 32.
- Lmflow: An extensible toolkit for finetuning and inference of large foundation models. CoRR, abs/2306.12420.
- How abilities in large language models are affected by supervised fine-tuning data composition. CoRR, abs/2310.05492.
- A mathematical framework for transformer circuits. Transformer Circuits Thread, 1.
- Denis Emelin and Rico Sennrich. 2021. Wino-X: Multilingual Winograd schemas for commonsense reasoning and coreference resolution. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 8517–8532.
- A framework for few-shot language model evaluation.
- Retrieval-augmented generation for large language models: A survey. CoRR, abs/2312.10997.
- Transformer feed-forward layers are key-value memories. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 5484–5495.
- Tora: A tool-integrated reasoning agent for mathematical problem solving. CoRR, abs/2309.17452.
- DuReader: a Chinese machine reading comprehension dataset from real-world applications. In Proceedings of the Workshop on Machine Reading for Question Answering, pages 37–46. Association for Computational Linguistics.
- Lora: Low-rank adaptation of large language models. In International Conference on Learning Representations.
- Ocnli: Original chinese natural language inference. In Findings of EMNLP.
- Lawyer llama technical report. CoRR, abs/2305.15062.
- Editing models with task arithmetic. In The Eleventh International Conference on Learning Representations.
- Alleviating representational shift for continual fine-tuning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 3810–3819.
- Continual pre-training of language models. In The Eleventh International Conference on Learning Representations.
- Diederik P Kingma and Jimmy Ba. 2014. Adam: A method for stochastic optimization. CoRR, abs/1412.6980.
- Multitasking framework for unsupervised simple definition generation. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 5934–5943.
- Blcu-icall at semeval-2022 task 1: Cross-attention multitasking framework for definition modeling. arXiv preprint arXiv:2204.07701.
- Contextualization distillation from large language model for knowledge graph completion. arXiv preprint arXiv:2402.01729.
- Multi-level contrastive learning for script-based character understanding. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 5995–6013.
- Cmmlu: Measuring massive multitask language understanding in chinese.
- Branch-train-merge: Embarrassingly parallel training of expert language models. In First Workshop on Interpolation Regularizers and Beyond at NeurIPS 2022.
- Large language models in finance: A survey. In Proceedings of the Fourth ACM International Conference on AI in Finance, pages 374–382.
- Chin-Yew Lin. 2004. Rouge: A package for automatic evaluation of summaries. In Text summarization branches out, pages 74–81.
- Ra-dit: Retrieval-augmented dual instruction tuning. arXiv preprint arXiv:2310.01352.
- Speciality vs generality: An empirical study on catastrophic forgetting in fine-tuning foundation models. CoRR, abs/2309.06256.
- Logiqa 2.0—an improved dataset for logical reasoning in natural language understanding. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 31:2947–2962.
- Logiqa: a challenge dataset for machine reading comprehension with logical reasoning. In Proceedings of the Twenty-Ninth International Conference on International Joint Conferences on Artificial Intelligence, pages 3622–3628.
- Tian Yu Liu and Stefano Soatto. 2023. Tangent model composition for ensembling and continual fine-tuning. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 18676–18686.
- Continual mixed-language pre-training for extremely low-resource neural machine translation. In Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021, pages 2706–2718.
- Punifiedner: A prompting-based unified ner system for diverse datasets. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 37, pages 13327–13335.
- What makes pre-trained language models better zero-shot learners? In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 2288–2303.
- Investigating forgetting in pre-trained representations through continual learning. CoRR, abs/2305.05968.
- An empirical study of catastrophic forgetting in large language models during continual fine-tuning. CoRR, abs/2308.08747.
- Continual learning in task-oriented dialogue systems. In EMNLP 2021-2021 Conference on Empirical Methods in Natural Language Processing, Proceedings.
- Emergent linguistic structure in artificial neural networks trained by self-supervision. Proceedings of the National Academy of Sciences, 117(48):30046–30054.
- Locating and editing factual associations in gpt. Advances in Neural Information Processing Systems, 35:17359–17372.
- Mass-editing memory in a transformer. In The Eleventh International Conference on Learning Representations.
- Are sixteen heads really better than one? Advances in neural information processing systems, 32.
- ReLU strikes back: Exploiting activation sparsity in large language models. In The Twelfth International Conference on Learning Representations.
- Crosslingual generalization through multitask finetuning. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 15991–16111.
- Crosslingual generalization through multitask finetuning.
- Proving test set contamination in black box language models. CoRR, abs/2310.17623.
- Fine-tuning or retrieval? comparing knowledge injection in llms. CoRR, abs/2312.05934.
- Task-specific skill localization in fine-tuned language models. CoRR, abs/2302.06600.
- Bleu: a method for automatic evaluation of machine translation. In Proceedings of the 40th annual meeting of the Association for Computational Linguistics, pages 311–318.
- Instruction tuning with gpt-4. CoRR, abs/2304.03277.
- Zero-infinity: Breaking the gpu memory wall for extreme scale deep learning. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, pages 1–14.
- Progressive prompts: Continual learning for language models. In The Eleventh International Conference on Learning Representations.
- A primer in bertology: What we know about how bert works. Transactions of the Association for Computational Linguistics, 8:842–866.
- Experience replay for continual learning. Advances in Neural Information Processing Systems, 32.
- Rylan Schaeffer. 2023. Pretraining on the test set is all you need. CoRR, abs/2309.08632.
- Understanding multimodal deep neural networks: A concept selection view.
- Incremental residual concept bottleneck models.
- Noam Shazeer. 2020. Glu variants improve transformer. CoRR, abs/2002.05202.
- Continual diffusion: Continual customization of text-to-image diffusion with c-lora. CoRR, abs/2304.06027.
- A simple and effective pruning approach for large language models. CoRR, abs/2306.11695.
- Tnt: Text normalization based pre-training of transformers for content moderation. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 4735–4741.
- Stanford alpaca: An instruction-following llama model. https://github.com/tatsu-lab/stanford_alpaca.
- Can llms learn from previous mistakes? investigating llms’ errors to boost for reasoning.
- Eliminating reasoning via inferring with planning: A new framework to guide llms’ non-linear thinking. arXiv preprint arXiv:2310.12342.
- Llama: Open and efficient foundation language models. CoRR, abs/2302.13971.
- Survey on factuality in large language models: Knowledge, retrieval and domain-specificity. CoRR, abs/2310.07521.
- Fingpt: Instruction tuning benchmark for open-source large language models in financial datasets. In NeurIPS 2023 Workshop on Instruction Tuning and Instruction Following.
- Trace: A comprehensive benchmark for continual learning in large language models. CoRR, abs/2310.06762.
- Self-instruct: Aligning language model with self generated instructions.
- Rehearsal-free continual language learning via efficient parameter isolation. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 10933–10946.
- Robust fine-tuning of zero-shot models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 7959–7971.
- Robust fine-tuning of zero-shot models. 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 7949–7961.
- Pretrained language model in continual learning: A comparative study. In International Conference on Learning Representations.
- Structured pruning learns compact and accurate models. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1513–1528.
- Explicit inductive bias for transfer learning with convolutional networks. In International Conference on Machine Learning, pages 2825–2834. PMLR.
- Language models are super mario: Absorbing abilities from homologous models as a free lunch. CoRR, abs/2311.03099.
- Assisting language learners: Automated trans-lingual definition generation via contrastive prompt learning. In Proceedings of the 18th Workshop on Innovative Use of NLP for Building Educational Applications (BEA 2023), pages 260–274.
- Fine-grained contrastive learning for definition generation. In Proceedings of the 2nd Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics and the 12th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pages 1001–1012.
- Improving low-resource knowledge tracing tasks by supervised pre-training and importance mechanism fine-tuning. arXiv preprint arXiv:2403.06725.
- A question-centric multi-experts contrastive learning framework for improving the accuracy and interpretability of deep sequential knowledge tracing models. arXiv preprint arXiv:2403.07322.
- Bertscore: Evaluating text generation with bert. In International Conference on Learning Representations.
- Continual sequence generation with adaptive compositional modules. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 3653–3667.
- A survey of large language models. CoRR, abs/2303.18223.
- Jec-qa: A legal-domain question answering dataset. In Proceedings of AAAI.
Collections
Sign up for free to add this paper to one or more collections.
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.