Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
102 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

MoFO: Momentum-Filtered Optimizer for Mitigating Forgetting in LLM Fine-Tuning (2407.20999v2)

Published 30 Jul 2024 in cs.LG and cs.AI

Abstract: Recently, LLMs have demonstrated remarkable capabilities in a wide range of tasks. Typically, an LLM is pre-trained on large corpora and subsequently fine-tuned on task-specific datasets. However, during fine-tuning, LLMs may forget the knowledge acquired in the pre-training stage, leading to a decline in general capabilities. To address this issue, we propose a new fine-tuning algorithm termed Momentum-Filtered Optimizer (MoFO). The key idea of MoFO is to iteratively select and update the model parameters with the largest momentum magnitudes. Compared to full-parameter training, MoFO achieves similar fine-tuning performance while keeping parameters closer to the pre-trained model, thereby mitigating knowledge forgetting. Unlike most existing methods for forgetting mitigation, MoFO combines the following two advantages. First, MoFO does not require access to pre-training data. This makes MoFO particularly suitable for fine-tuning scenarios where pre-training data is unavailable, such as fine-tuning checkpoint-only open-source LLMs. Second, MoFO does not alter the original loss function. This could avoid impairing the model performance on the fine-tuning tasks. We validate MoFO through rigorous convergence analysis and extensive experiments, demonstrating its superiority over existing methods in mitigating forgetting and enhancing fine-tuning performance.

Overview of the Paper: "MoFO: Momentum-Filtered Optimizer for Mitigating Forgetting in LLM Fine-Tuning"

Recently, the fine-tuning of LLMs has gained considerable traction due to their remarkable capabilities. However, a pervasive issue that surfaces during this process is the phenomenon of catastrophic forgetting, where the model tends to forget previously acquired knowledge from the pre-training stage once fine-tuned on new data. The paper "MoFO: Momentum-Filtered Optimizer for Mitigating Forgetting in LLM Fine-Tuning" addresses this challenge by introducing a new fine-tuning algorithm termed the Momentum-Filtered Optimizer (MoFO).

Methodology

The key innovation in MoFO lies in its selective parameter updating mechanism. Unlike traditional methods which often utilize full-parameter training, MoFO leverages the concept of momentum in optimization to determine which parameters should be updated. Specifically, the algorithm updates only those parameters exhibiting the largest momentum magnitudes. This momentum-filtered approach facilitates the model in maintaining a closer alignment with its pre-trained state, thereby reducing the risk of knowledge forgetting.

MoFO distinguishes itself by not requiring access to pre-training data—a significant advantage given that many open-source LLMs do not fully disclose their pre-training datasets. Moreover, MoFO does not alter the original loss function, thus avoiding any potential degradation in model performance due to modifications in the optimization objective.

Analytical and Empirical Validation

The paper rigorously evaluates MoFO through both theoretical and empirical lenses:

  1. Convergence Analysis: Theoretical analysis is conducted on a simplified variant of MoFO to demonstrate its convergence properties. Such an analysis asserts that the algorithm converges effectively, which is critical for ensuring that the proposed method is both sound and reliable.
  2. Empirical Performance: Extensive experiments are conducted on various tasks to validate the effectiveness of MoFO. The empirical results underscore MoFO's superiority in mitigating forgetting while achieving similar fine-tuning performance as full-parameter training methods.

Experimental Results

The experimental setup involves evaluating MoFO on tasks derived from datasets like MetaMathQA and Code-Alpaca, using LLMs such as Llama-2-7B and TinyLlama-1.1B. Key findings from these experiments include:

  • Fine-Tuning Performance: MoFO shows competitive performance on task-specific datasets compared to full fine-tuning and other baseline methods like L1L_1-regularization and L2L_2-regularization.
  • Preservation of General Capabilities: MoFO demonstrates a significant reduction in the degradation of general capabilities, as evidenced by metrics on various benchmarks such as MMLU, Commonsense, GSM8K, and HumanEval.
  • Continual Fine-Tuning: In the context of continual fine-tuning, MoFO exhibits enhanced performance on the TRACE benchmark, outperforming conventional methods in overall accuracy and backward transfer.

Implications and Future Work

The practical implications of MoFO are profound. By mitigating the issue of forgetting, MoFO extends the utility of LLMs in applications requiring incremental learning and adaptation to new tasks without sacrificing previously learned knowledge. Theoretically, it also opens new avenues for understanding the dynamics of fine-tuning in deep learning models.

Future developments could focus on refining the selection criteria for parameter updates and exploring the integration of MoFO with other optimization and regularization strategies. Additionally, extensions of MoFO to multi-modal LLMs could provide a broader scope of application and enhance the robustness of the approach.

Conclusion

In summary, "MoFO: Momentum-Filtered Optimizer for Mitigating Forgetting in LLM Fine-Tuning" presents a novel and efficient solution to a critical problem in the field of LLM fine-tuning. By leveraging momentum to selectively update parameters, MoFO achieves a balance between retaining pre-trained knowledge and optimizing for new tasks. This paper contributes a significant step forward in the sustainable development of LLMs, ensuring their adaptability and efficacy across diverse tasks and domains.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (77)
  1. Memory aware synapses: Learning what (not) to forget. In Proceedings of the European conference on computer vision (ECCV), pages 139–154, 2018.
  2. Online continual learning with maximal interfered retrieval. Advances in neural information processing systems, 32, 2019a.
  3. Gradient based sample selection for online continual learning. Advances in neural information processing systems, 32, 2019b.
  4. LoRA learns less and forgets less. arXiv preprint arXiv:2405.09673, 2024.
  5. Dark experience for general continual learning: a strong, simple baseline. Advances in neural information processing systems, 33:15920–15930, 2020.
  6. Co2l: Contrastive continual learning. In Proceedings of the IEEE/CVF International conference on computer vision, pages 9516–9525, 2021.
  7. S. Chaudhary. Code alpaca: An instruction-following llama model for code generation. https://github.com/sahil280114/codealpaca, 2023.
  8. Riemannian walk for incremental learning: Understanding forgetting and intransigence. In European Conference on Computer Vision, pages 556–572, 2018.
  9. Efficient lifelong learning with A-GEM. In International Conference on Learning Representations, 2019a.
  10. On tiny episodic memories in continual learning. arXiv preprint arXiv:1902.10486, 2019b.
  11. Using hindsight to anchor past knowledge in continual learning. In Proceedings of the AAAI conference on artificial intelligence, volume 35, pages 6993–7001, 2021.
  12. Evaluating large language models trained on code. arXiv preprint arXiv:2107.03374, 2021.
  13. Recall and learn: Fine-tuning deep pretrained language models with less forgetting. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 7870–7881, 2020.
  14. Think you have solved question answering? Try ARC, the AI2 reasoning challenge. arXiv preprint arXiv:1803.05457, 2018.
  15. Training verifiers to solve math word problems. arXiv preprint arXiv:2110.14168, 2021.
  16. Semi-supervised sequence learning. Advances in neural information processing systems, 28, 2015.
  17. How should pre-trained language models be fine-tuned towards adversarial robustness? Advances in Neural Information Processing Systems, 34:4356–4369, 2021.
  18. Orthogonal gradient descent for continual learning. In International Conference on Artificial Intelligence and Statistics, pages 3762–3773. PMLR, 2020.
  19. A framework for few-shot language model evaluation, 12 2023. URL https://zenodo.org/records/10256836.
  20. An empirical investigation of catastrophic forgetting in gradient-based neural networks. arXiv preprint arXiv:1312.6211, 2013.
  21. Knowledge is a region in weight space for fine-tuned language models. In Findings of the Association for Computational Linguistics: EMNLP 2023, pages 1350–1370, 2023.
  22. Memory efficient experience replay for streaming learning. In 2019 International Conference on Robotics and Automation (ICRA), pages 9769–9776. IEEE, 2019.
  23. Measuring massive multitask language understanding. In International Conference on Learning Representations, 2021.
  24. Iteration complexity analysis of block coordinate descent methods. Mathematical Programming, 163:85–114, 2017.
  25. LoRA: Low-rank adaptation of large language models. In International Conference on Learning Representations, 2022.
  26. Continual learning for text classification with information disentanglement based regularization. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 2736–2746, 2021.
  27. Hft: Half fine-tuning for large language models. arXiv preprint arXiv:2404.18466, 2024.
  28. Camels in a changing climate: Enhancing lm adaptation with tulu 2. arXiv preprint arXiv:2311.10702, 2023.
  29. Measuring catastrophic forgetting in neural networks. In Proceedings of the AAAI conference on artificial intelligence, volume 32, 2018.
  30. J. D. M.-W. C. Kenton and L. K. Toutanova. Bert: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of NAACL-HLT, pages 4171–4186, 2019.
  31. Overcoming catastrophic forgetting in neural networks. Proceedings of the national academy of sciences, 114(13):3521–3526, 2017.
  32. Controlling conditional language models without catastrophic forgetting. In International Conference on Machine Learning, pages 11499–11528. PMLR, 2022.
  33. Understanding catastrophic forgetting in language models via implicit inference. In The Twelfth International Conference on Learning Representations, 2024.
  34. Explicit inductive bias for transfer learning with convolutional networks. In International Conference on Machine Learning, pages 2825–2834. PMLR, 2018.
  35. Z. Li and D. Hoiem. Learning without forgetting. IEEE transactions on pattern analysis and machine intelligence, 40(12):2935–2947, 2017.
  36. Speciality vs generality: An empirical study on catastrophic forgetting in fine-tuning foundation models. arXiv preprint arXiv:2309.06256, 2023.
  37. More than catastrophic forgetting: Integrating general capabilities for domain-specific llms. arXiv preprint arXiv:2405.17830, 2024.
  38. D. Lopez-Paz and M. Ranzato. Gradient episodic memory for continual learning. Advances in neural information processing systems, 30, 2017.
  39. Z. Lu and L. Xiao. On the complexity analysis of randomized block-coordinate descent methods. Mathematical Programming, 152:615–642, 2015.
  40. Badam: A memory efficient full parameter training method for large language models. arXiv preprint arXiv:2404.02827, 2024.
  41. An empirical study of catastrophic forgetting in large language models during continual fine-tuning. arXiv preprint arXiv:2308.08747, 2023.
  42. M. McCloskey and N. J. Cohen. Catastrophic interference in connectionist networks: The sequential learning problem. In Psychology of learning and motivation, volume 24, pages 109–165. Elsevier, 1989.
  43. Regularization techniques for fine-tuning in neural machine translation. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, pages 1489–1494, 2017.
  44. Y. Nesterov. Efficiency of coordinate descent methods on huge-scale optimization problems. SIAM Journal on Optimization, 22(2):341–362, 2012.
  45. Coordinate descent converges faster with the gauss-southwell rule than random selection. In International Conference on Machine Learning, pages 1632–1641. PMLR, 2015.
  46. Let’s make block coordinate descent converge faster: faster greedy rules, message-passing, active-set complexity, and superlinear convergence. Journal of Machine Learning Research, 23(131):1–74, 2022.
  47. Training language models to follow instructions with human feedback. Advances in neural information processing systems, 35:27730–27744, 2022.
  48. Task-specific skill localization in fine-tuned language models. In International Conference on Machine Learning, pages 27011–27033. PMLR, 2023.
  49. Improving language understanding with unsupervised learning. 2018.
  50. Effect of scale on catastrophic forgetting in neural networks. In International Conference on Learning Representations, 2021.
  51. Encoder based lifelong learning. In Proceedings of the IEEE international conference on computer vision, pages 1320–1328, 2017.
  52. Analyzing and reducing catastrophic forgetting in parameter efficient tuning. arXiv preprint arXiv:2402.18865, 2024.
  53. P. Richtárik and M. Takáč. Iteration complexity of randomized block-coordinate descent methods for minimizing a composite function. Mathematical Programming, 144(1):1–38, 2014.
  54. Learning to learn without forgetting by maximizing transfer and minimizing interference. In International Conference on Learning Representations, 2019a.
  55. Scalable recollections for continual lifelong learning. In Proceedings of the AAAI conference on artificial intelligence, volume 33, pages 1352–1359, 2019b.
  56. Online structured laplace approximations for overcoming catastrophic forgetting. Advances in Neural Information Processing Systems, 31, 2018.
  57. Experience replay for continual learning. Advances in neural information processing systems, 32, 2019.
  58. Continual learning of large language models: A comprehensive survey. arXiv preprint arXiv:2404.16789, 2024.
  59. R. Sun and M. Hong. Improved iteration complexity bounds of cyclic block coordinate descent for convex problems. Advances in Neural Information Processing Systems, 28, 2015.
  60. Gcr: Gradient coreset based replay buffer selection for continual learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 99–108, 2022.
  61. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288, 2023.
  62. P. Tseng. Convergence of a block coordinate descent method for nondifferentiable minimization. Journal of optimization theory and applications, 109:475–494, 2001.
  63. Orthogonal subspace learning for language model continual learning. In The 2023 Conference on Empirical Methods in Natural Language Processing, 2023a.
  64. Trace: A comprehensive benchmark for continual learning in large language models. arXiv preprint arXiv:2310.06762, 2023b.
  65. Inscl: A data-efficient continual learning paradigm for fine-tuning large language models with instructions. In Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers), pages 663–677, 2024.
  66. Efficient meta lifelong-learning with limited memory. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 535–548, 2020.
  67. Continual learning for large language models: A survey. arXiv preprint arXiv:2402.01364, 2024.
  68. J. Xu and J. Zhang. Random masking finds winning tickets for parameter efficient fine-tuning. arXiv preprint arXiv:2405.02596, 2024.
  69. Ties-merging: Resolving interference when merging models. Advances in Neural Information Processing Systems, 36, 2024.
  70. Dynosaur: A dynamic growth paradigm for instruction-tuning data curation. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 4031–4047, 2023.
  71. Metamath: Bootstrap your own mathematical questions for large language models. In The Twelfth International Conference on Learning Representations, 2024a.
  72. Language models are super mario: Absorbing abilities from homologous models as a free lunch. In Forty-first International Conference on Machine Learning, 2024b.
  73. Hellaswag: Can a machine really finish your sentence? In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 4791–4800, 2019.
  74. Continual learning through synaptic intelligence. In International conference on machine learning, pages 3987–3995. PMLR, 2017.
  75. Tinyllama: An open-source small language model. arXiv preprint arXiv:2401.02385, 2024.
  76. Adam can converge without any modification on update rules. Advances in neural information processing systems, 35:28386–28399, 2022.
  77. Model tailor: Mitigating catastrophic forgetting in multi-modal large language models. arXiv preprint arXiv:2402.12048, 2024.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (7)
  1. YuPeng Chen (48 papers)
  2. Senmiao Wang (3 papers)
  3. Zhihang Lin (13 papers)
  4. Zeyu Qin (16 papers)
  5. Yushun Zhang (13 papers)
  6. Tian Ding (20 papers)
  7. Ruoyu Sun (70 papers)
Citations (1)
Youtube Logo Streamline Icon: https://streamlinehq.com