Papers

Topics

Authors

Recent

View all

Gemini 2.5 Flash

102 tokens/sec

GPT-4o

59 tokens/sec

Gemini 2.5 Pro Pro

43 tokens/sec

o3 Pro

6 tokens/sec

GPT-4.1 Pro

50 tokens/sec

DeepSeek R1 via Azure Pro

28 tokens/sec

2000 character limit reached

261 5

MoFO: Momentum-Filtered Optimizer for Mitigating Forgetting in LLM Fine-Tuning (2407.20999v2)

Published 30 Jul 2024 in cs.LG and cs.AI

Abstract: Recently, LLMs have demonstrated remarkable capabilities in a wide range of tasks. Typically, an LLM is pre-trained on large corpora and subsequently fine-tuned on task-specific datasets. However, during fine-tuning, LLMs may forget the knowledge acquired in the pre-training stage, leading to a decline in general capabilities. To address this issue, we propose a new fine-tuning algorithm termed Momentum-Filtered Optimizer (MoFO). The key idea of MoFO is to iteratively select and update the model parameters with the largest momentum magnitudes. Compared to full-parameter training, MoFO achieves similar fine-tuning performance while keeping parameters closer to the pre-trained model, thereby mitigating knowledge forgetting. Unlike most existing methods for forgetting mitigation, MoFO combines the following two advantages. First, MoFO does not require access to pre-training data. This makes MoFO particularly suitable for fine-tuning scenarios where pre-training data is unavailable, such as fine-tuning checkpoint-only open-source LLMs. Second, MoFO does not alter the original loss function. This could avoid impairing the model performance on the fine-tuning tasks. We validate MoFO through rigorous convergence analysis and extensive experiments, demonstrating its superiority over existing methods in mitigating forgetting and enhancing fine-tuning performance.

PDF Abstract

Overview of the Paper: "MoFO: Momentum-Filtered Optimizer for Mitigating Forgetting in LLM Fine-Tuning"

Recently, the fine-tuning of LLMs has gained considerable traction due to their remarkable capabilities. However, a pervasive issue that surfaces during this process is the phenomenon of catastrophic forgetting, where the model tends to forget previously acquired knowledge from the pre-training stage once fine-tuned on new data. The paper "MoFO: Momentum-Filtered Optimizer for Mitigating Forgetting in LLM Fine-Tuning" addresses this challenge by introducing a new fine-tuning algorithm termed the Momentum-Filtered Optimizer (MoFO).

Methodology

The key innovation in MoFO lies in its selective parameter updating mechanism. Unlike traditional methods which often utilize full-parameter training, MoFO leverages the concept of momentum in optimization to determine which parameters should be updated. Specifically, the algorithm updates only those parameters exhibiting the largest momentum magnitudes. This momentum-filtered approach facilitates the model in maintaining a closer alignment with its pre-trained state, thereby reducing the risk of knowledge forgetting.

MoFO distinguishes itself by not requiring access to pre-training data—a significant advantage given that many open-source LLMs do not fully disclose their pre-training datasets. Moreover, MoFO does not alter the original loss function, thus avoiding any potential degradation in model performance due to modifications in the optimization objective.

Analytical and Empirical Validation

The paper rigorously evaluates MoFO through both theoretical and empirical lenses:

Convergence Analysis: Theoretical analysis is conducted on a simplified variant of MoFO to demonstrate its convergence properties. Such an analysis asserts that the algorithm converges effectively, which is critical for ensuring that the proposed method is both sound and reliable.
Empirical Performance: Extensive experiments are conducted on various tasks to validate the effectiveness of MoFO. The empirical results underscore MoFO's superiority in mitigating forgetting while achieving similar fine-tuning performance as full-parameter training methods.

Experimental Results

The experimental setup involves evaluating MoFO on tasks derived from datasets like MetaMathQA and Code-Alpaca, using LLMs such as Llama-2-7B and TinyLlama-1.1B. Key findings from these experiments include:

Fine-Tuning Performance: MoFO shows competitive performance on task-specific datasets compared to full fine-tuning and other baseline methods like $L_1$ -regularization and $L_2$ -regularization.
Preservation of General Capabilities: MoFO demonstrates a significant reduction in the degradation of general capabilities, as evidenced by metrics on various benchmarks such as MMLU, Commonsense, GSM8K, and HumanEval.
Continual Fine-Tuning: In the context of continual fine-tuning, MoFO exhibits enhanced performance on the TRACE benchmark, outperforming conventional methods in overall accuracy and backward transfer.

Implications and Future Work

The practical implications of MoFO are profound. By mitigating the issue of forgetting, MoFO extends the utility of LLMs in applications requiring incremental learning and adaptation to new tasks without sacrificing previously learned knowledge. Theoretically, it also opens new avenues for understanding the dynamics of fine-tuning in deep learning models.

Future developments could focus on refining the selection criteria for parameter updates and exploring the integration of MoFO with other optimization and regularization strategies. Additionally, extensions of MoFO to multi-modal LLMs could provide a broader scope of application and enhance the robustness of the approach.

Conclusion

In summary, "MoFO: Momentum-Filtered Optimizer for Mitigating Forgetting in LLM Fine-Tuning" presents a novel and efficient solution to a critical problem in the field of LLM fine-tuning. By leveraging momentum to selectively update parameters, MoFO achieves a balance between retaining pre-trained knowledge and optimizing for new tasks. This paper contributes a significant step forward in the sustainable development of LLMs, ensuring their adaptability and efficacy across diverse tasks and domains.

PDF Markdown Bookmark Chat (Pro)

References (77)

Authors (7)

YuPeng Chen (48 papers)
Senmiao Wang (3 papers)
Zhihang Lin (13 papers)
Zeyu Qin (16 papers)
Yushun Zhang (13 papers)
Tian Ding (20 papers)
Ruoyu Sun (70 papers)

Citations (1)

View on Semantic Scholar

Tweets

https://twitter.com/Euclaise_/status/1825000536686383287

https://twitter.com/papers_anon/status/1818490151229964699

YouTube

Show All Videos