Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
162 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
45 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

LM-Cocktail: Resilient Tuning of Language Models via Model Merging (2311.13534v4)

Published 22 Nov 2023 in cs.CL, cs.AI, and cs.IR

Abstract: The pre-trained LLMs are continually fine-tuned to better support downstream applications. However, this operation may result in significant performance degeneration on general tasks beyond the targeted domain. To overcome this problem, we propose LM-Cocktail which enables the fine-tuned model to stay resilient in general perspectives. Our method is conducted in the form of model merging, where the fine-tuned LLM is merged with the pre-trained base model or the peer models from other domains through weighted average. Despite simplicity, LM-Cocktail is surprisingly effective: the resulted model is able to achieve a strong empirical performance in the whole scope of general tasks while preserving a superior capacity in its targeted domain. We conduct comprehensive experiments with LLama and BGE model on popular benchmarks, including FLAN, MMLU, MTEB, whose results validate the efficacy of our proposed method. The code and checkpoints are available at https://github.com/FlagOpen/FlagEmbedding/tree/master/LM_Cocktail.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (48)
  1. Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901.
  2. Evaluating large language models trained on code. arXiv preprint arXiv:2107.03374.
  3. Recall and learn: Fine-tuning deep pretrained language models with less forgetting. arXiv preprint arXiv:2004.12651.
  4. Uprise: Universal prompt retrieval for improving zero-shot evaluation.
  5. Scaling instruction-finetuned language models. arXiv preprint arXiv:2210.11416.
  6. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805.
  7. Fine-tuning pretrained language models: Weight initializations, data orders, and early stopping. arXiv preprint arXiv:2002.06305.
  8. An empirical investigation of catastrophic forgetting in gradient-based neural networks. arXiv preprint arXiv:1312.6211.
  9. The elements of statistical learning: data mining, inference, and prediction, volume 2. Springer.
  10. Measuring massive multitask language understanding. arXiv preprint arXiv:2009.03300.
  11. Lorahub: Efficient cross-task generalization via dynamic lora composition.
  12. Editing models with task arithmetic. arXiv preprint arXiv:2212.04089.
  13. Patching open-vocabulary models by interpolating weights. In Advances in Neural Information Processing Systems.
  14. Dataless knowledge fusion by merging weights of language models. arXiv preprint arXiv:2212.09849.
  15. Overcoming catastrophic forgetting in neural networks. Proceedings of the national academy of sciences, 114(13):3521–3526.
  16. Simple and scalable predictive uncertainty estimation using deep ensembles. Advances in neural information processing systems, 30.
  17. Zhizhong Li and Derek Hoiem. 2017. Learning without forgetting. IEEE transactions on pattern analysis and machine intelligence, 40(12):2935–2947.
  18. Versatile black-box optimization. In Proceedings of the 2020 Genetic and Evolutionary Computation Conference, pages 620–628.
  19. Roberta: A robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692.
  20. An empirical study of catastrophic forgetting in large language models during continual fine-tuning. arXiv preprint arXiv:2308.08747.
  21. Michael S Matena and Colin Raffel. 2022. Merging models with fisher-weighted averaging. In Advances in Neural Information Processing Systems.
  22. Mteb: Massive text embedding benchmark. arXiv preprint arXiv:2210.07316.
  23. Model fusion of heterogeneous neural networks via cross-layer alignment. arXiv preprint arXiv:2110.15538.
  24. Training language models to follow instructions with human feedback. Advances in Neural Information Processing Systems, 35:27730–27744.
  25. Can you trust your model’s uncertainty? evaluating predictive uncertainty under dataset shift. Advances in neural information processing systems, 32.
  26. What to pre-train on? efficient intermediate task selection. arXiv preprint arXiv:2104.08247.
  27. Learning transferable visual models from natural language supervision. In International conference on machine learning, pages 8748–8763. PMLR.
  28. Language models are unsupervised multitask learners. OpenAI blog, 1(8):9.
  29. Exploring the limits of transfer learning with a unified text-to-text transformer. The Journal of Machine Learning Research, 21(1):5485–5551.
  30. Encoder based lifelong learning. In Proceedings of the IEEE international conference on computer vision, pages 1320–1328.
  31. icarl: Incremental classifier and representation learning. In Proceedings of the IEEE conference on Computer Vision and Pattern Recognition, pages 2001–2010.
  32. Experience replay for continual learning. Advances in Neural Information Processing Systems, 32.
  33. Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950.
  34. Continual learning with deep generative replay. Advances in neural information processing systems, 30.
  35. Large language models encode clinical knowledge. arXiv preprint arXiv:2212.13138.
  36. Multitask pre-training of modular prompt for chinese few-shot learning. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 11156–11172.
  37. Overcoming catastrophic forgetting during domain adaptation of neural machine translation. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 2062–2068.
  38. Llama 2: Open foundation and fine-tuned chat models.
  39. Spot: Better frozen model adaptation through soft prompt transfer. arXiv preprint arXiv:2110.07904.
  40. Learning to retrieve in-context examples for large language models.
  41. Finetuned language models are zero-shot learners. In International Conference on Learning Representations.
  42. Finetuned language models are zero-shot learners. arXiv preprint arXiv:2109.01652.
  43. Model soups: averaging weights of multiple fine-tuned models improves accuracy without increasing inference time. In Proceedings of the 39th International Conference on Machine Learning, volume 162 of Proceedings of Machine Learning Research, pages 23965–23998. PMLR.
  44. Robust fine-tuning of zero-shot models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 7959–7971.
  45. pi-tuning: Transferring multimodal foundation models with optimal multi-task interpolation. In International Conference on Machine Learning, pages 37713–37727. PMLR.
  46. C-pack: Packaged resources to advance general chinese embedding.
  47. Resolving interference when merging models. arXiv preprint arXiv:2306.01708.
  48. Language models are super mario: Absorbing abilities from homologous models as a free lunch. arXiv preprint arXiv:2311.03099.
Citations (18)

Summary

  • The paper introduces LM-Cocktail, a post-refinement method that uses weighted model merging to alleviate catastrophic forgetting and preserve general capabilities.
  • The methodology integrates fine-tuned and base models with minimal computational overhead, fitting smoothly into existing LLM workflows.
  • Empirical results on both decoder and encoder-based models confirm enhanced task-specific accuracy without degrading performance on broader tasks.

Analyzing "LM-Cocktail: Resilient Tuning of LLMs via Model Merging"

The paper "LM-Cocktail: Resilient Tuning of LLMs via Model Merging" by Shitao Xiao et al., presents a methodological advancement for fine-tuning LLMs. This research addresses the prevalent issue of catastrophic forgetting, where fine-tuning LLMs can enhance performance on specific tasks but diminish general capabilities across other tasks.

Methodology Overview

The authors propose LM-Cocktail, a straightforward and effective approach involving model merging through weighted averaging. This technique integrates the fine-tuned model with a pre-trained base model and potentially other domain-specific fine-tuned peer models. The process is designed to bolster task-specific performance without sacrificing the model's ability to perform well on a broad array of tasks.

The methodology offers a practical solution by functioning as a post-refinement step requiring minimal computational overhead. This makes it highly compatible with existing workflows. The algorithm computes merging weights using a softened loss-based approach on few-shot examples from the target domain.

Experimental Evaluation

The paper employs LLama and BGE models, offering comprehensive experiments across benchmarks like FLAN, MMLU, and MTEB. Notably, the empirical results reflect strong improvements in task-specific domains with no detrimental effects on general tasks, showcasing the efficacy of LM-Cocktail.

  • Decoder-based LLMs: The experiments reveal that LM-Cocktail enhances both target task accuracy and performance across other tasks. Merging with the base model and additional fine-tuned models (LM-Cocktail2_2 and LM-Cocktail10_{10}) consistently showed improved general capabilities.
  • Encoder-based Models: Similar trends were observed, illustrating the versatility of LM-Cocktail across different types of LLMs.

Implications and Future Directions

The proposed method offers significant practical advantages. It highlights an efficient strategy to sustain general LLM capabilities while tailoring specific tasks, all without necessitating extensive retraining sessions. The approach’s simplicity ensures broad applicability, including scenarios lacking full fine-tuning capacities due to data or resource constraints.

Theoretical contributions lie in its development of a resilient tuning paradigm harmonizing specialist and generalist model traits. This can be instrumental in evolving LLM applications, especially in diverse multi-task or rapidly changing environments.

Future explorations could investigate more sophisticated weight computation methods or extend applicability across various architectures. Additionally, exploring integrations with more complex model interaction frameworks may yield even richer performance metrics in dynamically tuned AI systems.

The LM-Cocktail approach underscores the potential of model merging innovations, inviting further research on resilient fine-tuning methods in the field of AI.

Github Logo Streamline Icon: https://streamlinehq.com