Efficient Model Development through Fine-tuning Transfer (2503.20110v1)

Published 25 Mar 2025 in cs.CL, cs.AI, and cs.LG

Abstract: Modern LLMs struggle with efficient updates, as each new pretrained model version requires repeating expensive alignment processes. This challenge also applies to domain- or language-specific models, where fine-tuning on specialized data must be redone for every new base model release. In this paper, we explore the transfer of fine-tuning updates between model versions. Specifically, we derive the diff vector from one source model version, which represents the weight changes from fine-tuning, and apply it to the base model of a different target version. Through empirical evaluations on various open-weight model versions, we show that transferring diff vectors can significantly improve the target base model, often achieving performance comparable to its fine-tuned counterpart. For example, reusing the fine-tuning updates from Llama 3.0 8B leads to an absolute accuracy improvement of 10.7% on GPQA over the base Llama 3.1 8B without additional training, surpassing Llama 3.1 8B Instruct. In a multilingual model development setting, we show that this approach can significantly increase performance on target-language tasks without retraining, achieving an absolute improvement of 4.7% and 15.5% on Global MMLU for Malagasy and Turkish, respectively, compared to Llama 3.1 8B Instruct. Our controlled experiments reveal that fine-tuning transfer is most effective when the source and target models are linearly connected in the parameter space. Additionally, we demonstrate that fine-tuning transfer offers a stronger and more computationally efficient starting point for further fine-tuning. Finally, we propose an iterative recycling-then-finetuning approach for continuous model development, which improves both efficiency and effectiveness. Our findings suggest that fine-tuning transfer is a viable strategy to reduce training costs while maintaining model performance.

Summary

The paper introduces a novel fine-tuning transfer method that reuses diff vectors to update LLMs without full retraining.
It demonstrates significant benchmark improvements, including a 10.7% accuracy boost on GPQA and notable gains in multilingual settings.
The approach provides a robust initialization for further tuning, enabling faster convergence and efficient iterative model upgrades.

This paper addresses the challenge of efficiently updating LLMs. When a new version of a base pretrained LLM is released, existing fine-tuned models (e.g., for specific tasks, domains, or languages) typically need to be retrained on the new base model, which is computationally expensive and time-consuming. The authors propose a method called "fine-tuning transfer" to mitigate this cost.

The core idea is to reuse the changes learned during fine-tuning from a source model version ( $s$ ) and apply them to a target model version ( $t$ ). This is done by calculating a "diff vector," $\Delta_s$ , which is the element-wise difference between the fine-tuned source model weights ( $m'_s$ ) and the base source model weights ( $m_s$ ):

$\Delta_s = m'_s - m_s$

This diff vector $\Delta_s$ is then added to the base target model weights ( $m_t$ ) to create a new, potentially improved model without requiring gradient-based training on the target version:

$m'_{t_{approx}} = m_t + \Delta_s$

The authors explore two scenarios:

Recycling: Transferring updates from an older source version to a newer target version (e.g., Llama 3.0 -> Llama 3.1) to avoid retraining alignment procedures like instruction tuning.
Backporting: Transferring updates from a newer source version to an older target version (e.g., Llama 3.1 -> Llama 3.0) when the older base model might be better optimized for a specific use case but could benefit from newer fine-tuning improvements.

Key Findings and Experiments:

Feasibility: Experiments across different versions of Llama, OLMo, and T\"ulu models show that simply adding the diff vector ( $\Delta_s$ ) to the target base model ( $m_t$ ) significantly improves performance on various benchmarks (MMLU, GSM8K, MATH, ARC_C, GPQA, IFEval). In many cases, the performance of $m_t + \Delta_s$ is comparable to the actually fine-tuned target model ( $m'_t$ ). For instance, recycling Llama 3.0 8B Instruct updates to Llama 3.1 8B base model improved GPQA accuracy by 10.7%, surpassing the official Llama 3.1 8B Instruct model without any training. The method successfully transfers instruction-following and reasoning capabilities.
Multilingual Application: The paper demonstrates the effectiveness of recycling in a multilingual setting. By fine-tuning Llama 3.0 8B Instruct for Malagasy and Turkish and transferring the diff vectors to Llama 3.1 8B Instruct, they achieved significant improvements (4.7% and 15.5% absolute gain on Global MMLU, respectively) over the base Llama 3.1 Instruct model without retraining. This is particularly useful for low-resource languages where frequent retraining is costly. The effectiveness depends on whether the target base model is still outperformed by the fine-tuned source model in the target language.
Conditions for Effectiveness: Using intermediate checkpoints of OLMo 2 7B as different "versions," controlled experiments revealed that fine-tuning transfer works best when the source ( $m_s$ ) and target ( $m_t$ ) models are "close" in the parameter space and exhibit linear mode connectivity. Furthermore, the target base model ( $m_t$ ) needs to possess a certain level of capability to effectively leverage the transferred updates; weaker base models showed less benefit.
Starting Point for Further Fine-tuning: The merged model ( $m_t + \Delta_s$ ) serves as a strong initialization point for further fine-tuning. This "transferring-then-finetuning" approach leads to faster convergence and often achieves higher final performance compared to fine-tuning the target base model ( $m_t$ ) from scratch. It also generalizes well to unseen tasks, suggesting it doesn't cause overfitting.
Iterative Development: For scenarios with continuous model releases, the paper proposes an "iterative recycling-then-finetuning" algorithm. Instead of just using the latest diff vector, it accumulates diff vectors over versions. This iterative approach showed improved performance and training efficiency compared to directly applying only the immediately preceding version's diff vector or standard fine-tuning.

Contributions:

Introduces fine-tuning transfer via diff vectors between model versions.
Demonstrates its effectiveness in reducing training costs while maintaining performance.
Validates the approach for efficient multilingual model development.
Identifies linear mode connectivity and sufficient target model capability as conditions for success.
Proposes "transferring-then-finetuning" and "iterative recycling-then-finetuning" strategies for enhanced performance and efficiency.

In conclusion, the paper presents fine-tuning transfer as a practical and computationally efficient method to handle frequent updates in LLM development, allowing developers to leverage improvements from new base models without the full cost of retraining fine-tuned versions.

PDF Markdown

Follow-up Questions

We haven't generated follow-up questions for this paper yet.

Generate Now

Related Papers

Find Related Papers

Authors (5)

Tweets

https://twitter.com/tuvllms/status/1905057527462810072