Fast Fisher Grafting: Efficient Model Merging

Updated 21 September 2025

Fast Fisher Grafting (FFG) is a curvature-informed technique that selects and sparsifies task-specific weight updates using second-moment estimates from adaptive optimizers.
It induces extremely low-rank masks concentrating on early attention and token embedding layers to localize and preserve essential task knowledge during model merging.
FFG employs memory-light curvature compression and curvature-aware aggregation methods to scale efficiently to large models while minimizing negative transfer.

Fast Fisher Grafting (FFG) is a curvature-informed parameter selection and merging methodology central to recent advances in efficient model merging for LLMs fine-tuned on distinct tasks. FFG exploits per-parameter second-moment estimates from adaptive optimizers to curate and sparsify task-relevant weight updates, thereby reducing destructive interference when composing multiple specialized models. The method induces extremely low-rank masks that localize knowledge crucial for each capability, especially in sensitive model subcomponents such as early attention and embedding layers. FFG is implemented together with curvature-aware merging strategies and is compatible with highly compressed curvature storage, allowing scalable application to large models while preserving the empirical effectiveness and diagnostic power of the approach (Mahdavinia et al., 14 Sep 2025).

1. Mathematical Foundation of Fast Fisher Grafting

Fast Fisher Grafting operates on the parameter updates obtained from supervised fine-tuning (SFT). Let each expert model start from a common base parameter vector $\mathbf{w}_0$ and optimize to a task-specific solution $\mathbf{w}^*$ , yielding a task vector: $\Delta \mathbf{w} = \mathbf{w}^* - \mathbf{w}_0.$ The saliency of each parameter update is scored using a second-order Taylor approximation of the loss: $\Delta \mathcal{L} \approx \frac{1}{2} \sum_i H_{ii} (\Delta w_i)^2,$ where $H_{ii}$ is the diagonal element of the Hessian matrix. Direct computation of the Hessian is intractable in large models, so FFG employs the optimizer’s second-moment estimate $v_i$ (e.g., from Adam) as a diagonal curvature proxy. The per-parameter saliency becomes: $s_i = (\Delta w_i)^2 \cdot v_i.$ FFG selects the top- $k$ or a fixed percentage $\rho$ of parameters with the highest saliency to form a binary mask $\mathbf{m}$ . The pruned (or “grafted”) task vector is then: $\Delta \mathbf{w}' = \mathbf{m} \circ \Delta \mathbf{w},$ using elementwise multiplication. Low-saliency coordinates are replaced with base weights, concentrating the grafted knowledge in the most influential subset of parameters.

2. Curvature-Driven Task Localization and Model Merging

FFG’s curvature-driven selection ensures that retained updates reside in highly task-salient and sensitive regions of the parameter space. This results in several empirically observed outcomes:

In LLMs spanning multiple expert SFT checkpoints, FFG-induced masks are extremely low-rank, concentrating in early attention query/key projections and token embeddings.
Mask structures frequently zero out entire rows or columns, reflecting the localization of task adaptation.
FFG serves as a denoiser, removing low-saliency or conflicting updates that would otherwise contribute to cross-task interference during model merging.

When merging multiple experts, FFG is integrated into a more general curvature-aware aggregation: $\mathbf{w}_{\text{merged}} = \mathbf{w}_0 + \left( \sum_{\tau=1}^T \mathbf{P}^*_\tau \right)^{-1} \left( \sum_{\tau=1}^T \mathbf{P}^*_\tau (\mathbf{m}_\tau \circ \Delta \mathbf{w}_\tau) \right),$ where each expert’s diagonal preconditioner $\mathbf{P}^*_\tau = \operatorname{Diag}(\sqrt{v_\tau + \epsilon})$ emphasizes updates along directions of significant curvature. By masking only salient updates, FFG reduces destructive aggregation across unrelated tasks and enables robust model merging.

A crucial empirical finding is the substantial curvature overlap between fine-tuned experts; the dominant regions of high curvature tend to coincide, facilitating effective parameter sharing and supporting the select-and-merge paradigm of FFG.

3. Memory-Light Compression of Second Moments

Full second-moment statistics per parameter dramatically increase storage requirements. To address this, FFG leverages a memory-light AdaFactor-inspired compression. For each $m \times n$ weight matrix, only row sums $\mathbf{r} \in \mathbb{R}^m$ and column sums $\mathbf{c} \in \mathbb{R}^n$ are recorded. The second-moment matrix is then reconstructed as: $\hat{\mathbf{v}} = \frac{ \mathbf{r} \mathbf{c}^\top }{ \mathbf{1}_m^\top \mathbf{r} },$ where $\mathbf{1}_m$ is an $m$ -dimensional vector of ones. Empirical investigations reveal that the stable rank of these matrices is very low (often <1.3), justifying the accuracy of the compression. This scheme reduces memory overhead by several orders of magnitude, while empirical benchmarks confirm that the qualitative and quantitative behavior of FFG (including downstream merging quality) is preserved under this low-rank curvature approximation.

4. Saliency Mask Structure and Curvature Overlap

The FFG mask structure has important implications:

Empirically, the saliency masks exhibit extremely low rank, indicating concentrated, coordinated parameter adaptation.
The masks are most prominent in token embeddings and the first attention query/key projection matrices.
Rows and columns are often pruned in contiguous blocks, underscoring structured task localization.

Curvature overlap between models—quantified by overlap of high-curvature regions in $v_i$ —accounts for the effectiveness of linear and curvature-aware merging even in scenarios with weakly related tasks. FFG leverages this overlap: only directions commonly salient across experts are likely to be retained after masking, reducing the risk of destructive interference. This provides a new perspective on why simple linear merging strategies can perform surprisingly well in practice.

5. Comparison to Previous Approaches and Role in Modern Model Merging

Previous model merging approaches typically relied on direct averaging of task vectors or orthogonal projection-based approaches to resolve conflicts. These do not account for the per-parameter curvature structure, leading to interference in under-constrained or high-conflict cases.

FFG differs by directly incorporating optimizer curvature as a saliency modulator, guiding both sparsification and task-vector selection. In conjunction with curvature-aware aggregation (Optimization Trajectory Aware Merging, or OTA), FFG outperforms strong weight-space baselines for multi-expert LLM merging by significantly reducing negative transfer (Mahdavinia et al., 14 Sep 2025).

Ablations confirm that without the FFG step, merged models deteriorate in composite capabilities and demonstrate increased interference. With FFG and compressed curvature, merging performance is nearly invariant, highlighting the robustness of the overall approach.

6. Practical Workflow and Benefits

The FFG methodology is fully compatible with large-scale transformer models and modern deep learning libraries. It requires only:

Base and expert model weights,
Optimizer second-moment statistics (direct from Adam or reconstructed via AdaFactor-style compression),
Standard tensor operations for saliency scoring, ranking, and mask application.

The workflow for merging with FFG is as follows:

For each expert model, compute $\Delta \mathbf{w}$ , obtain/compute $v_i$ , and score saliency $s_i$ .
Apply a mask $\mathbf{m}$ to select only the highest-saliency coordinates.
Aggregate all pruned task vectors using an appropriate curvature-aware weighted average.

The practical benefits include:

Scalability to multi-billion parameter models via compressed curvature statistics.
Drastic reduction in cross-task negative transfer.
The ability to diagnose and visualize task localization through saliency mask patterns.
Empirical robustness of merged model performance across a wide range of sparsity levels and tasks.

7. Impact and Future Directions

FFG establishes a curvature-driven standard for model merging, leveraging optimizer byproducts to inform principled sparse selection of task edits. Its empirical effectiveness and storage efficiency, together with the framework’s published open-source implementation, offer a blueprint for scalable LLM composition without joint retraining or manually engineered task adapters (Mahdavinia et al., 14 Sep 2025).

A plausible implication is that further extensions could generalize FFG to cross-architecture merges, automated per-layer sparsity schedules, or settings with heterogeneous optimizer statistics. Continued investigation into the limits of curvature overlap and its regularization effects on implicit task disentanglement may also reveal deeper connections to neural network modularity and transferability.

PDF Markdown Chat (Pro)

References (1)

Harnessing Optimization Dynamics for Curvature-Informed Model Merging (2025)

Whiteboard

Generate a whiteboard explanation of this topic.

Follow Topic

Get notified by email when new papers are published related to Fast Fisher Grafting (FFG).