Papers
Topics
Authors
Recent
Search
2000 character limit reached

Shadow-FT: Transfer Finetuning for LLMs

Updated 19 March 2026
  • Shadow-FT is a transfer-based finetuning methodology that leverages the near-equivalence in weight space between Base and Instruct models to improve performance.
  • It transfers task-specific updates learned on a Base model to an Instruct model, yielding consistent gains of 2–4 points across diverse benchmarks.
  • The approach requires no extra parameters or processing overhead, supports methods like LoRA, and maintains reproducible results for varied domain adaptations.

Shadow-FT is a transfer-based finetuning methodology for LLMs that leverages the near-equivalence in weight space between unaligned base models (“Base”) and their corresponding instruction-tuned variants (“Instruct”). Rather than directly tuning Instruct models—a process that often yields marginal gains or even degrades performance—Shadow-FT exploits the “weight-shadowing” phenomenon: task adaptations learned on Base models can be transferred to Instruct models, consistently yielding superior results on a broad array of language benchmarks. This approach introduces no additional parameters or specialized objectives and supports integration with both full-parameter and parameter-efficient finetuning algorithms such as LoRA (Wu et al., 19 May 2025).

1. Motivations and Empirical Background

Empirical studies on mainstream LLMs (including Qwen, Llama, Gemma, Falcon, Yi) reveal that direct finetuning of Instruct models (e.g., via supervised finetuning or LoRA) achieves marginal performance improvement and often causes performance degradation across diverse language, code, and math tasks. For instance, tuning Qwen-3-4B-Instruct on a mixture of math, code, and reasoning examples leads to a decline in Math-7 (73.8 → 71.2), Code-3 (66.4 → 62.5), and Reasoning-9 (63.7 → 61.1) using vanilla LoRA. This phenomenon persists across different model sizes and finetuning regimes.

The underlying basis for Shadow-FT is the observed negligible difference between weights of paired Base and Instruct models. The relative gap ratio σ\sigma between corresponding weights WBW_B (Base) and WIW_I (Instruct), defined as

σ=WBWIWB+WI,\sigma = \frac{\sum |W_B - W_I|}{\sum |W_B| + \sum |W_I|},

yields σ0.016\sigma \approx 0.016 (less than 2%) for Llama-3.1-8B, and always σ<0.05\sigma < 0.05 for all commonly released model pairs. This close similarity enables seamless transfer of parameter updates between Base and Instruct, motivating the Shadow-FT methodology (Wu et al., 19 May 2025).

2. Methodological Formulation

The core Shadow-FT procedure is as follows:

  1. Initialize with paired Base (WBW_B) and Instruct (WIW_I) models of identical architecture.
  2. Fine-tune the Base model WBW_B on the target task dataset DD using the desired tuning algorithm (e.g., SFT, LoRA) to obtain WB+W_B^+.
  3. Extract the task update: ΔW=WB+WB\Delta W = W_B^{+} - W_B.
  4. Apply (graft) the task update to the Instruct model: WI+=WI+ΔWW_I^+ = W_I + \Delta W.
  5. Perform inference with WI+W_I^+.

This pipeline is compatible with full-parameter finetuning and PEFT. For low-rank adaptation (LoRA), ΔW\Delta W is simply the learned low-rank matrix (e.g., ABAB), directly grafted onto Instruct. No extra parameters, passes, or specialized losses are introduced.

Pseudocode:

1
2
3
4
5
6
7
procedure SHADOW_FT(BaseWeights W_B, InstructWeights W_I, Data D, TuningMethod Tune)
    W_base = copy(W_B)
    W_base_plus = Tune(W_base, D)
    ΔW = W_base_plus  W_base
    W_instruct_plus = W_I + ΔW
    return W_instruct_plus
end procedure

3. Experimental Results and Benchmarking

Extensive evaluation compares Shadow-FT against standard finetuning (FT) and LoRA (PEFT) across a variety of models (Qwen 3, Llama 3 series) and 19 benchmarks (math, code, reasoning).

Model Method Math-7 Code-3 Reasoning-9 Avg.
Qwen-3-4B Vanilla 73.8 66.4 63.7 68.0
FT (full) 72.9 66.4 62.9 66.2
FT (LoRA) 71.2 62.5 61.1 65.0
Shadow-FT (full) 73.7 67.4 64.9 68.7
Shadow-FT (LoRA) 75.9 70.5 65.0 70.5
Qwen-3-14B Vanilla 75.8 76.8 71.2 74.6
FT (full) 75.2 76.2 70.6 73.4
FT (LoRA) 73.3 74.4 70.4 72.7
Shadow-FT (full) 78.9 77.0 71.4 75.8
Shadow-FT (LoRA) 78.6 77.8 71.5 75.9

Shadow-FT achieves consistent gains of 2–4 points average over conventional tuning. On domain adaptation tasks (medical, code, math), Shadow-FT outperforms both vanilla and standard LoRA by 1–7 points depending on model and task. LoRA rank ablation demonstrates monotonic improvements for Shadow-FT (LoRA) with increased rank, unlike vanilla LoRA where performance plateaus and then degrades.

4. Extensions: Preference Optimization and Multimodal LLMs

Shadow-FT is compatible with preference optimization techniques such as Direct Preference Optimization (DPO). In Shadow-DPO, DPO is applied to the Base model, and the learned preference updates are transferred to the Instruct model. For Llama-3.1-8B, Shadow-DPO with LoRA (r=128r=128) marginally outperforms standard DPO (avg. 55.39 vs. 54.62).

The methodology also extends to multimodal models (MLLMs). Experiments in the ChartQA domain show that Shadow-FT (LoRA, r=128r=128) on Gemma-3-27B improves average accuracy from 60.28 (vanilla LoRA) to 63.80. Similar or superior results are observed with Llama-3.2-Vision and Qwen-3-Vision models.

5. Implementation and Reproducibility

Shadow-FT requires only access to a matched pair of Base and Instruct weights. The approach is implemented on top of LLaMA-Factory and HuggingFace Transformers with all task deltas and evaluation scripts publicly available. Typical training employs a single-epoch grid search for learning rates, batch size 2 (gradient accumulation steps=16, effective batch 32), sequence cutoff of 4096 tokens, and LoRA rank r=128r=128. All primary results are reproducible in under four hours on an 8xA100 setup. Users must ensure architectural parity between Base and Instruct models.

For cases where only instruct-only checkpoints exist and no Base is available (e.g., Qwen-3-32B), Shadow-FT cannot be directly applied. As a possible workaround, de-tuning procedures or “proxy shadow” models are proposed as future directions (Wu et al., 19 May 2025).

6. Analysis, Limitations, and Implications

Shadow-FT consistently prevents the performance "degeneration" frequently observed when fine-tuning Instruct models directly. Empirically, updates learned in Base models generalize better, possibly due to preserved instruction alignment axes in the target model. The approach introduces zero additional parameters, requires no extra forward/backward passes, and integrates seamlessly with both SFT and PEFT paradigms.

A key limitation is the requirement for a paired Base checkpoint with identical architecture; if such a pairing does not exist, the method is not readily applicable. The theoretical reason for the superior generalization of Base-learned updates remains an open subject for future research (Wu et al., 19 May 2025).

7. Summary and Practical Guidance

Shadow-FT provides a robust, minimal-overhead approach for task adaptation in instruction-tuned LLMs. Researchers seeking to adapt Instruct models should:

  1. Obtain Base and Instruct model weights.
  2. Apply their standard finetuning method on Base, extract the update ΔW\Delta W, and transfer it to Instruct.
  3. Evaluate the adapted Instruct model on downstream tasks.

This procedure outperforms both direct Instruct finetuning and PEFT across diverse families and benchmarks, including extensions to preference optimization and multimodal tasks. Further developments may address the absence of accessible Base checkpoints for certain model families (Wu et al., 19 May 2025).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Shadow-FT.