Papers
Topics
Authors
Recent
Search
2000 character limit reached

LM-Cocktail: Resilient Tuning of Language Models via Model Merging

Published 22 Nov 2023 in cs.CL, cs.AI, and cs.IR | (2311.13534v4)

Abstract: The pre-trained LLMs are continually fine-tuned to better support downstream applications. However, this operation may result in significant performance degeneration on general tasks beyond the targeted domain. To overcome this problem, we propose LM-Cocktail which enables the fine-tuned model to stay resilient in general perspectives. Our method is conducted in the form of model merging, where the fine-tuned LLM is merged with the pre-trained base model or the peer models from other domains through weighted average. Despite simplicity, LM-Cocktail is surprisingly effective: the resulted model is able to achieve a strong empirical performance in the whole scope of general tasks while preserving a superior capacity in its targeted domain. We conduct comprehensive experiments with LLama and BGE model on popular benchmarks, including FLAN, MMLU, MTEB, whose results validate the efficacy of our proposed method. The code and checkpoints are available at https://github.com/FlagOpen/FlagEmbedding/tree/master/LM_Cocktail.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (48)
  1. Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901.
  2. Evaluating large language models trained on code. arXiv preprint arXiv:2107.03374.
  3. Recall and learn: Fine-tuning deep pretrained language models with less forgetting. arXiv preprint arXiv:2004.12651.
  4. Uprise: Universal prompt retrieval for improving zero-shot evaluation.
  5. Scaling instruction-finetuned language models. arXiv preprint arXiv:2210.11416.
  6. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805.
  7. Fine-tuning pretrained language models: Weight initializations, data orders, and early stopping. arXiv preprint arXiv:2002.06305.
  8. An empirical investigation of catastrophic forgetting in gradient-based neural networks. arXiv preprint arXiv:1312.6211.
  9. The elements of statistical learning: data mining, inference, and prediction, volume 2. Springer.
  10. Measuring massive multitask language understanding. arXiv preprint arXiv:2009.03300.
  11. Lorahub: Efficient cross-task generalization via dynamic lora composition.
  12. Editing models with task arithmetic. arXiv preprint arXiv:2212.04089.
  13. Patching open-vocabulary models by interpolating weights. In Advances in Neural Information Processing Systems.
  14. Dataless knowledge fusion by merging weights of language models. arXiv preprint arXiv:2212.09849.
  15. Overcoming catastrophic forgetting in neural networks. Proceedings of the national academy of sciences, 114(13):3521–3526.
  16. Simple and scalable predictive uncertainty estimation using deep ensembles. Advances in neural information processing systems, 30.
  17. Zhizhong Li and Derek Hoiem. 2017. Learning without forgetting. IEEE transactions on pattern analysis and machine intelligence, 40(12):2935–2947.
  18. Versatile black-box optimization. In Proceedings of the 2020 Genetic and Evolutionary Computation Conference, pages 620–628.
  19. Roberta: A robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692.
  20. An empirical study of catastrophic forgetting in large language models during continual fine-tuning. arXiv preprint arXiv:2308.08747.
  21. Michael S Matena and Colin Raffel. 2022. Merging models with fisher-weighted averaging. In Advances in Neural Information Processing Systems.
  22. Mteb: Massive text embedding benchmark. arXiv preprint arXiv:2210.07316.
  23. Model fusion of heterogeneous neural networks via cross-layer alignment. arXiv preprint arXiv:2110.15538.
  24. Training language models to follow instructions with human feedback. Advances in Neural Information Processing Systems, 35:27730–27744.
  25. Can you trust your model’s uncertainty? evaluating predictive uncertainty under dataset shift. Advances in neural information processing systems, 32.
  26. What to pre-train on? efficient intermediate task selection. arXiv preprint arXiv:2104.08247.
  27. Learning transferable visual models from natural language supervision. In International conference on machine learning, pages 8748–8763. PMLR.
  28. Language models are unsupervised multitask learners. OpenAI blog, 1(8):9.
  29. Exploring the limits of transfer learning with a unified text-to-text transformer. The Journal of Machine Learning Research, 21(1):5485–5551.
  30. Encoder based lifelong learning. In Proceedings of the IEEE international conference on computer vision, pages 1320–1328.
  31. icarl: Incremental classifier and representation learning. In Proceedings of the IEEE conference on Computer Vision and Pattern Recognition, pages 2001–2010.
  32. Experience replay for continual learning. Advances in Neural Information Processing Systems, 32.
  33. Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950.
  34. Continual learning with deep generative replay. Advances in neural information processing systems, 30.
  35. Large language models encode clinical knowledge. arXiv preprint arXiv:2212.13138.
  36. Multitask pre-training of modular prompt for chinese few-shot learning. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 11156–11172.
  37. Overcoming catastrophic forgetting during domain adaptation of neural machine translation. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 2062–2068.
  38. Llama 2: Open foundation and fine-tuned chat models.
  39. Spot: Better frozen model adaptation through soft prompt transfer. arXiv preprint arXiv:2110.07904.
  40. Learning to retrieve in-context examples for large language models.
  41. Finetuned language models are zero-shot learners. In International Conference on Learning Representations.
  42. Finetuned language models are zero-shot learners. arXiv preprint arXiv:2109.01652.
  43. Model soups: averaging weights of multiple fine-tuned models improves accuracy without increasing inference time. In Proceedings of the 39th International Conference on Machine Learning, volume 162 of Proceedings of Machine Learning Research, pages 23965–23998. PMLR.
  44. Robust fine-tuning of zero-shot models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 7959–7971.
  45. pi-tuning: Transferring multimodal foundation models with optimal multi-task interpolation. In International Conference on Machine Learning, pages 37713–37727. PMLR.
  46. C-pack: Packaged resources to advance general chinese embedding.
  47. Resolving interference when merging models. arXiv preprint arXiv:2306.01708.
  48. Language models are super mario: Absorbing abilities from homologous models as a free lunch. arXiv preprint arXiv:2311.03099.
Citations (18)

Summary

  • The paper introduces LM-Cocktail, a method that merges fine-tuned and pre-trained models to mitigate catastrophic forgetting.
  • It uses weighted parameter averaging based on few-shot examples to balance performance across general and target tasks.
  • Experiments on both encoder (BGE) and decoder (LLaMA) models demonstrate consistent improvements across multiple benchmarks.

"LM-Cocktail: Resilient Tuning of LLMs via Model Merging"

Abstract and Introduction

The paper "LM-Cocktail: Resilient Tuning of LLMs via Model Merging" (2311.13534) addresses the challenge of fine-tuning pre-trained LLMs for specific tasks without compromising their performance on general tasks. Traditional fine-tuning often results in catastrophic forgetting, where a model loses its ability to perform well on tasks outside its fine-tuning domain. This research proposes LM-Cocktail, a model merging method that preserves general capabilities while enhancing performance on targeted tasks.

The proposed method combines fine-tuned models with pre-trained base models and models from other domains using weighted averages. The approach is both simple and effective, applying as a post-processing step after fine-tuning, thereby maintaining compatibility with existing training pipelines. Empirical evaluations on LLaMA and BGE models demonstrate the efficacy of LM-Cocktail across various benchmarks, proving its potential for universal application. Figure 1

Figure 1: The illustration of LM-Cocktail demonstrating improved accuracy on new target tasks while maintaining performance on other tasks.

LM-Cocktail: Framework and Variations

General Paradigm

The LM-Cocktail approach involves merging models by averaging their parameters based on performance on few-shot examples from the target domain. Given a pre-trained base model and a set of domain-specific fine-tuned models, LM-Cocktail constructs a resilient-tuned model that integrates strengths from multiple models. The merging formula is expressed as:

MrαMt+(1α)(wiMi)\mathcal{M}_r \leftarrow \alpha \mathcal{M}_t + (1-\alpha) \sum (w_i * \mathcal{M}_i)

Where Mr\mathcal{M}_r is the resilient-tuned model, α\alpha is a hyper-parameter, and wiw_i are weights computed from prediction losses on target domain examples.

Variations

LM-Cocktail includes adaptations for scenarios where fine-tuned models in general domains are not available:

  • Mono-Specialist: Merges only the base and targeted fine-tuned model.
  • Without Fine-tuning: Uses peer models from other general domains when targeted data is insufficient.

Experimental Setup

Decoder-based and Encoder-based Models

Experiments are performed using decoder-based (LLaMA) and encoder-based (BGE) models, across benchmarks like FLAN, MMLU, and MTEB. Fine-tuning involves various tasks and datasets, with evaluations conducted on unseen tasks using LM-Cocktail variants.

Results

LM-Cocktail consistently enhances performance compared to standard fine-tuning: Figure 2

Figure 2: Performance with different alpha values demonstrates fine-tuning efficacy across tasks.

  • Decoder Models: Retain performance in the target domain and exhibit enhanced general task capabilities. LM-Cocktail2_{2} (merging base and target models) and LM-Cocktail10_{10} (including general domain specialists) outperform fine-tuned models on unrelated tasks.
  • Encoder Models: Provide similar improvements, validating LM-Cocktail's universality across model types.

LM-Cocktail without Fine-tuning

Investigating scenarios where fine-tuning is infeasible, LM-Cocktail leverages models from other domains to enhance performance on new tasks. Results from additional tasks in MMLU show LM-Cocktail surpasses traditional methods, proving effective even with minimal data.

Analysis of Merging Approach

Impact of Weight Alpha

Adjustments to the merging weight α\alpha reveal improvements on general tasks with slight trade-offs in target task performance, emphasizing the method's flexibility to tune model characteristics dynamically. Figure 3

Figure 3: Performance of encoder-based LMs with different merging weights.

Example Number Effectiveness

Testing various example sizes confirms that LM-Cocktail achieves stability with few samples, highlighting efficiency in model merging processes.

Conclusion

LM-Cocktail presents a practical solution for the problems of catastrophic forgetting in LLMs. Its ability to merge models efficiently broadens deployment possibilities without significant computational overhead. Future work may explore further optimization strategies for merging weights and extend the approach to other model types.

Overall, LM-Cocktail is a versatile tool in the domain of AI model tuning, offering resilience without sacrificing performance. The provided empirical evidence establishes a foundation for continued exploration and application in diverse AI systems.

Paper to Video (Beta)

No one has generated a video about this paper yet.

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Collections

Sign up for free to add this paper to one or more collections.

Tweets

Sign up for free to view the 5 tweets with 68 likes about this paper.