Mitigating Catastrophic Forgetting in Language Transfer via Model Merging
The paper presents a novel approach called Branch-and-Merge (BaM) to address the critical issue of catastrophic forgetting when adapting LLMs to new languages. Catastrophic forgetting severely limits the usefulness of adapted models since they tend to lose their previously acquired capabilities during the adaptation process. The primary contribution of this work is the introduction of BaM, which iteratively merges models fine-tuned on subsets of the training data, resulting in reduced weight changes of higher quality, thereby mitigating catastrophic forgetting.
Methodology
The BaM method is inspired by principles from continual learning, leveraging lower magnitude but higher quality weight changes to balance learning on the new domain while retaining knowledge from the source domain. The approach involves partitioning the training data into multiple slices and iteratively training multiple models on these slices in parallel. The resulting models are subsequently merged to form the base model for the next iteration. This iterative process continues until all data slices have been utilized. BaM employs model merging techniques such as Linear interpolation and Spherical Linear interpolation (Slerp), with Slerp showing slightly better performance.
Results
The experimental evaluation demonstrates the effectiveness of BaM across various scenarios, including continued pretraining (CPT) and instruction tuning (IFT) in both alphabet-sharing (German) and non-alphabet-sharing (Bulgarian) languages. Key findings include:
- Bulgarian CPT: BaM outperforms standard continuous pretraining in terms of both retention of the source language knowledge (English) and adaptation to the target language (Bulgarian). For instance, BaM resulted in higher average English benchmark performance while maintaining comparable or better Bulgarian performance compared to standard CPT.
- German CPT: Similar trends were observed with BaM improving performance in German benchmarks by and English benchmarks by compared to standard CPT.
- Instruction Tuning: BaM showed significant improvements in mitigating forgetting during instruction finetuning, outperforming standard methods. For example, BaM tuning of the \llama[8B] model for Bulgarian resulted in better performance than the native \llama[8B-Instruct] model by in Bulgarian and in English.
Implications and Future Directions
The implications of this research are both practical and theoretical. Practically, BaM provides a robust methodology for adapting LLMs to low-resource languages without significant loss of capabilities in the source language, enhancing the utility of these models in diverse linguistic contexts. Theoretically, the approach opens avenues for further exploration in model merging and continual learning, particularly in refining the balance between learning new tasks and retaining prior knowledge.
Future work could extend this approach to other domains beyond language adaptation, potentially generalizing BaM to broader scenarios of domain adaptation. Additionally, exploring the combination of BaM with other regularization techniques and conducting more extensive evaluations across different languages and tasks will be valuable.
Conclusion
The Branch-and-Merge framework presents an effective solution to the persistent problem of catastrophic forgetting in language transfer, offering a way to adapt LLMs to new languages while preserving their original capabilities. By focusing on iterative model merging and high-quality weight changes, BaM improves both learning and retention, making it a promising approach for future advancements in LLM adaptation and beyond.