An Empirical Study of Catastrophic Forgetting in LLMs During Continual Fine-tuning
The paper presents a rigorous empirical analysis exploring catastrophic forgetting (CF) in LLMs during continual instruction fine-tuning. The paper is motivated by the growing utilization of LLMs in diverse applications, where they are continually refined using new data pertinent to specific tasks. Continual learning paradigms, though beneficial, pose significant challenges due to CF, where the model may lose information ingrained during previous training while acquiring new knowledge. This paper stands out by evaluating the CF phenomenon across varying model architectures and scales, specifically using models like BLOOMZ, mT0, LLAMA, and ALPACA.
The research aims to address three primary questions concerning CF in LLMs: whether general knowledge within LLMs is forgotten during continual fine-tuning, the specific types of knowledge most susceptible to forgetting, and the impact of model scale, architecture, and general instruction tuning on this problem. Experiments were performed across models of different scales (from 1b to 7.1b parameters) and types (decoder-only and encoder-decoder architectures) to assess their susceptibility to CF.
The authors designed an evaluation approach encapsulating multiple domains and tasks to assess the general knowledge retention. The domains considered included domain knowledge, reasoning ability, reading comprehension, and model biases, with benchmark datasets such as MMLU, Hellaswag, BoolQ, and CrowS-Pairs employed for thorough evaluation.
Key findings indicate that CF is indeed a prevalent issue across all tested LLMs. Intriguingly, as model size increases, the severity of forgetting also escalates, with larger models (like the BLOOMZ-7.1b) showing more substantial performance drops compared to their smaller counterparts. Across different architecture types, the decoder-only model BLOOMZ consistently demonstrated less CF compared to the encoder-decoder model mT0, suggesting its architecture might be more robust in retaining knowledge during fine-tuning.
A novel contribution of this paper is the observation that instruction-tuned models like ALPACA are less prone to CF during subsequent fine-tuning than untuned models such as LLAMA. This implies that prior general instruction tuning might bolster the CF resilience in LLMs, highlighting an area for potential optimization in model training strategies.
Finally, the paper observed an unexpected yet insightful outcome, showing that continual instruction tuning had a side effect of reducing inherent biases within the models, such as gender and racial biases, offering an additional benefit within broader ethical and fairness considerations in AI.
These findings bear substantial implications for both theoretical understanding and practical application. They suggest that addressing CF in the continual fine-tuning process is critical for maintaining LLM performance. Moving forward, researchers are encouraged to develop more sophisticated training methodologies and architectures that mitigate CF to better leverage the full potential of LLMs. This paper not only enriches the existing understanding of CF in LLMs but also paves the way for further explorations into designing AI systems capable of sustaining comprehensive knowledge over sequential learning tasks.