- The paper introduces a novel fine-tuning transfer method that reuses diff vectors to update LLMs without full retraining.
- It demonstrates significant benchmark improvements, including a 10.7% accuracy boost on GPQA and notable gains in multilingual settings.
- The approach provides a robust initialization for further tuning, enabling faster convergence and efficient iterative model upgrades.
This paper addresses the challenge of efficiently updating LLMs. When a new version of a base pretrained LLM is released, existing fine-tuned models (e.g., for specific tasks, domains, or languages) typically need to be retrained on the new base model, which is computationally expensive and time-consuming. The authors propose a method called "fine-tuning transfer" to mitigate this cost.
The core idea is to reuse the changes learned during fine-tuning from a source model version (s) and apply them to a target model version (t). This is done by calculating a "diff vector," Δs, which is the element-wise difference between the fine-tuned source model weights (ms′) and the base source model weights (ms):
Δs=ms′−ms
This diff vector Δs is then added to the base target model weights (mt) to create a new, potentially improved model without requiring gradient-based training on the target version:
mtapprox′=mt+Δs
The authors explore two scenarios:
- Recycling: Transferring updates from an older source version to a newer target version (e.g., Llama 3.0 -> Llama 3.1) to avoid retraining alignment procedures like instruction tuning.
- Backporting: Transferring updates from a newer source version to an older target version (e.g., Llama 3.1 -> Llama 3.0) when the older base model might be better optimized for a specific use case but could benefit from newer fine-tuning improvements.
Key Findings and Experiments:
- Feasibility: Experiments across different versions of Llama, OLMo, and T\"ulu models show that simply adding the diff vector (Δs) to the target base model (mt) significantly improves performance on various benchmarks (MMLU, GSM8K, MATH, ARC_C, GPQA, IFEval). In many cases, the performance of mt+Δs is comparable to the actually fine-tuned target model (mt′). For instance, recycling Llama 3.0 8B Instruct updates to Llama 3.1 8B base model improved GPQA accuracy by 10.7%, surpassing the official Llama 3.1 8B Instruct model without any training. The method successfully transfers instruction-following and reasoning capabilities.
- Multilingual Application: The paper demonstrates the effectiveness of recycling in a multilingual setting. By fine-tuning Llama 3.0 8B Instruct for Malagasy and Turkish and transferring the diff vectors to Llama 3.1 8B Instruct, they achieved significant improvements (4.7% and 15.5% absolute gain on Global MMLU, respectively) over the base Llama 3.1 Instruct model without retraining. This is particularly useful for low-resource languages where frequent retraining is costly. The effectiveness depends on whether the target base model is still outperformed by the fine-tuned source model in the target language.
- Conditions for Effectiveness: Using intermediate checkpoints of OLMo 2 7B as different "versions," controlled experiments revealed that fine-tuning transfer works best when the source (ms) and target (mt) models are "close" in the parameter space and exhibit linear mode connectivity. Furthermore, the target base model (mt) needs to possess a certain level of capability to effectively leverage the transferred updates; weaker base models showed less benefit.
- Starting Point for Further Fine-tuning: The merged model (mt+Δs) serves as a strong initialization point for further fine-tuning. This "transferring-then-finetuning" approach leads to faster convergence and often achieves higher final performance compared to fine-tuning the target base model (mt) from scratch. It also generalizes well to unseen tasks, suggesting it doesn't cause overfitting.
- Iterative Development: For scenarios with continuous model releases, the paper proposes an "iterative recycling-then-finetuning" algorithm. Instead of just using the latest diff vector, it accumulates diff vectors over versions. This iterative approach showed improved performance and training efficiency compared to directly applying only the immediately preceding version's diff vector or standard fine-tuning.
Contributions:
- Introduces fine-tuning transfer via diff vectors between model versions.
- Demonstrates its effectiveness in reducing training costs while maintaining performance.
- Validates the approach for efficient multilingual model development.
- Identifies linear mode connectivity and sufficient target model capability as conditions for success.
- Proposes "transferring-then-finetuning" and "iterative recycling-then-finetuning" strategies for enhanced performance and efficiency.
In conclusion, the paper presents fine-tuning transfer as a practical and computationally efficient method to handle frequent updates in LLM development, allowing developers to leverage improvements from new base models without the full cost of retraining fine-tuned versions.