Mitigating Catastrophic Forgetting in Language Transfer via Model Merging (2407.08699v2)

Published 11 Jul 2024 in cs.LG

Abstract: As open-weight LLMs achieve ever more impressive performances across a wide range of tasks in English, practitioners aim to adapt these models to different languages. However, such language adaptation is often accompanied by catastrophic forgetting of the base model's capabilities, severely limiting the usefulness of the resulting model. We address this issue by proposing Branch-and-Merge (BaM), a new adaptation method based on iteratively merging multiple models, fine-tuned on a subset of the available training data. BaM is based on the insight that this yields lower magnitude but higher quality weight changes, reducing forgetting of the source domain while maintaining learning on the target domain. We demonstrate in an extensive empirical study on Bulgarian and German that BaM can significantly reduce forgetting while matching or even improving target domain performance compared to both standard continued pretraining and instruction finetuning across different model architectures.

PDF HTML Abstract

Mitigating Catastrophic Forgetting in Language Transfer via Model Merging

The paper presents a novel approach called Branch-and-Merge (BaM) to address the critical issue of catastrophic forgetting when adapting LLMs to new languages. Catastrophic forgetting severely limits the usefulness of adapted models since they tend to lose their previously acquired capabilities during the adaptation process. The primary contribution of this work is the introduction of BaM, which iteratively merges models fine-tuned on subsets of the training data, resulting in reduced weight changes of higher quality, thereby mitigating catastrophic forgetting.

Methodology

The BaM method is inspired by principles from continual learning, leveraging lower magnitude but higher quality weight changes to balance learning on the new domain while retaining knowledge from the source domain. The approach involves partitioning the training data into multiple slices and iteratively training multiple models on these slices in parallel. The resulting models are subsequently merged to form the base model for the next iteration. This iterative process continues until all data slices have been utilized. BaM employs model merging techniques such as Linear interpolation and Spherical Linear interpolation (Slerp), with Slerp showing slightly better performance.

Results

The experimental evaluation demonstrates the effectiveness of BaM across various scenarios, including continued pretraining (CPT) and instruction tuning (IFT) in both alphabet-sharing (German) and non-alphabet-sharing (Bulgarian) languages. Key findings include:

Bulgarian CPT: BaM outperforms standard continuous pretraining in terms of both retention of the source language knowledge (English) and adaptation to the target language (Bulgarian). For instance, BaM resulted in $1.4\%$ higher average English benchmark performance while maintaining comparable or better Bulgarian performance compared to standard CPT.
German CPT: Similar trends were observed with BaM improving performance in German benchmarks by $0.7\%$ and English benchmarks by $1.0\%$ compared to standard CPT.
Instruction Tuning: BaM showed significant improvements in mitigating forgetting during instruction finetuning, outperforming standard methods. For example, BaM tuning of the \llama[8B] model for Bulgarian resulted in better performance than the native \llama[8B-Instruct] model by $10.9\%$ in Bulgarian and $1.3\%$ in English.

Implications and Future Directions

The implications of this research are both practical and theoretical. Practically, BaM provides a robust methodology for adapting LLMs to low-resource languages without significant loss of capabilities in the source language, enhancing the utility of these models in diverse linguistic contexts. Theoretically, the approach opens avenues for further exploration in model merging and continual learning, particularly in refining the balance between learning new tasks and retaining prior knowledge.

Future work could extend this approach to other domains beyond language adaptation, potentially generalizing BaM to broader scenarios of domain adaptation. Additionally, exploring the combination of BaM with other regularization techniques and conducting more extensive evaluations across different languages and tasks will be valuable.

Conclusion

The Branch-and-Merge framework presents an effective solution to the persistent problem of catastrophic forgetting in language transfer, offering a way to adapt LLMs to new languages while preserving their original capabilities. By focusing on iterative model merging and high-quality weight changes, BaM improves both learning and retention, making it a promising approach for future advancements in LLM adaptation and beyond.