SAMerging Method: Advanced Merging Strategy

Updated 28 December 2025

SAMerging Method is a framework that merges multiple expert models and structured summaries to preserve key features and boost multitask performance.
It employs PAC-Bayes theory and multi-teacher distillation to optimize merge coefficients without requiring joint retraining.
Practical implementations include layerwise merging, sensitivity-guided balancing, and zero-order optimization, yielding notable gains in vision, language, and segmentation tasks.

The SAMerging Method encompasses a family of frameworks centered on the principled combination—“merging”—of multiple complex objects, often neural network models or structured summaries, such that the resulting merged object preserves salient properties or enables enhanced performance across heterogeneous tasks. Within recent literature, the term most frequently denotes advanced model-merging strategies that blend the parameter spaces, predictions, or summary statistics of expert models, enabling consolidated multitask inference or robust analytics without retraining. The following explication synthesizes the core methodologies, theoretical underpinnings, and empirical results for SAMerging and related paradigms as presented in recent work.

1. Post-hoc Model Merging: Problem Setting and Motivation

Post-hoc model merging seeks to consolidate a set of independently fine-tuned expert models—each trained on different (and often inaccessible) data domains—into a single neural network that maintains or even improves performance on all constituent tasks, without joint training or direct multitask data (Dalili et al., 24 Dec 2025). This approach is necessitated by scenarios in which retraining is computationally prohibitive or data privacy precludes re-accessing original corpora. A canonical instantiation expresses the merged model parameters $\theta_{\text{merge}}$ as a convex combination of expert deviations from a shared base:

$\theta_{\text{merge}} = \theta_0 + \sum_{t=1}^T \alpha_t (\theta_t - \theta_0)$

where $\theta_0$ is the pretrained backbone, $\theta_t$ are expert weights, and $\alpha_t \in [0,1]$ are task coefficients subject to $\sum_t \alpha_t = 1$ .

Challenges include:

Brittleness to coefficient scaling: Small changes in $\{\alpha_t\}$ can cause sharp degradation in multi-task accuracy.
Lack of ground-truth objective: The correct weighting cannot be deduced without explicit access to joint data, leading naive approaches to underperform.

SAMerging addresses these by introducing both theoretically founded criteria for optimality and data-efficient procedures for selecting merge coefficients (Dalili et al., 24 Dec 2025).

2. Theoretical Foundations: PAC-Bayes Flatness and Knowledge Distillation

SAMerging pioneers a generalization bound tailored for model merging by unifying PAC-Bayes theory with flatness-aware risk control (Dalili et al., 24 Dec 2025). Each expert is associated with a Gaussian “posterior” $Q_t$ centered at $\theta_t$ ; a mixture $Q_{\text{merge}} = \sum b_j Q_j$ is constructed, and the expected risk is decomposed as:

$L_a(Q_{\text{merge}}) = \sum_{t=1}^T b_t L_{D_t}(Q_t) + H_Q(a, b)$

where $H_Q(a, b)$ is the “cross-task heterogeneity” measuring how poorly expert $Q_j$ generalizes to non-native tasks $D_i$ .

A flatness-aware PAC-Bayes generalization bound establishes that generalization error is minimized by:

Seeking flat minima (small gradient-norm penalty terms $G_{D_t}$ )
Aligning analysis weights $b$ with evaluation weights $a$
Minimizing kernel-weighted dispersion between merged and expert solutions

This analysis formally motivates learning $\{\alpha_t\}$ via knowledge distillation: minimizing the student–teacher Kullback-Leibler divergence over a small, unlabeled calibration set tightens the provable upper bound on excess risk.

3. SAMerging Algorithm: Layerwise, Flatness-Aware Merge via SAM Optimization

SAMerging’s operational pipeline (Dalili et al., 24 Dec 2025):

Merge parameterization: For each layer $l$ , build the merged weights

$\theta^l_{\text{merge}} = \theta^l_0 + \sum_{t=1}^T \alpha^l_t (\theta^l_t - \theta^l_0)$

Coefficient optimization (multi-teacher distillation): For calibration batch $B_t \sim D_t$ , define teacher soft labels $p_t(y|x) = \mathrm{Softmax}(f_{\theta_t}(x))$ and student $q(y|x) = \mathrm{Softmax}(f_{\theta_{\text{merge}}}(x))$ . Minimize:

$L_{\mathrm{KD}}(\theta_{\text{merge}}; \alpha) = \sum_{t=1}^T a_t \mathbb{E}_{x \sim B_t} \left[\mathrm{KL}(p_t(\cdot|x) \parallel q(\cdot|x))\right]$

Flatness promotion (Sharpness-Aware Minimization, SAM): Update $\alpha$ not on the direct gradient of $L_{\mathrm{KD}}$ , but rather maximize w.r.t. a perturbation in parameter space (within $\ell_2$ -ball of radius $\rho$ ), then descend $\alpha$ on the worst-case loss surface.

Typical settings use small calibration batches ($16$–$32$ samples per task), $\rho \sim 0.07$ , and often tie coefficients across layers for computational efficiency.

4. Extensions: Sensitivity-Guided, Self-Enhanced, and Zero-Order Variants

Sens-Merging: Sensitivity-Guided Balancing

Sens-Merging introduces per-layer, per-task scaling coefficients $\sigma_i^\ell$ obtained by combining parameter sensitivity within tasks and cross-task transferability (Liu et al., 18 Feb 2025). For each expert and layer:

Task sensitivity $\alpha_i^\ell$ : Reflects in-layer importance (gradient or ablation-based).
Cross-task scaling $\tau_i$ : Measures the average similarity of model outputs to those of other experts.
Optimal coefficients: Derived as a temperature-controlled softmax:

$\sigma_i^\ell = \frac{\exp\left(\tfrac{1}{T} \tau_i \alpha_i^\ell\right)} {\sum_j \exp\left(\tfrac{1}{T} \tau_j \alpha_j^\ell\right)}$

This approach improves multitask code generation and general language understanding benchmarks, surpassing uniform-coefficient approaches.

SE-Merging: Dynamic, Inference-Time Auto-Adaptive Merge

SE-Merging dynamically adapts merge weights per sample using deep-layer representations (Chen et al., 22 Jun 2025). For input $x$ , at representation layer $\ell$ , construct

$r_{\text{merged}} = f^{(\ell)}(x;\theta_{\text{merged}})$ ,
$r_i = f^{(\ell)}(x;\theta_{PT} + \lambda \tau_i)$ for all experts.

Compute distances $d_i = \|r_{\text{merged}} - r_i\|_2$ , map to normalized similarity scores $s_i$ , then rescale coefficients $\lambda_i(x)$ via softmax to enhance the most “expert-aligned” direction on a per-input basis. This enables inference-time dynamic merging with no extra training.

MedSAMix: Zero-Order Search for Domain-Balanced Segmentation

In the medical segmentation regime, MedSAMix uses layer-wise convex interpolation between a generalist (e.g., SAM) and a specialist (MedSAM), searching for optimal coefficients via zero-order stochastic optimization (e.g., SMAC with random forest surrogate and EI acquisition) (Yang et al., 14 Aug 2025). Both single-task and multi-objective (Pareto) formulations are addressed, with the objective being Dice or cross-entropy loss on small calibration sets. This method yields consistent gains across 25 clinical segmentation tasks, mitigating overfitting and domain bias.

5. SAMerging in Structured Data Summarization

Beyond deep models, the “SAMerging” framework encompasses “exactly mergeable summaries” for scalable analytics (Batagelj, 2023). Here, summaries (e.g., running statistics, histograms, top- $k$ lists) can be recursively combined via deterministic, associative operators $F$ :

$\Sigma(A \cup B) = F(\Sigma(A), \Sigma(B)), \quad A \cap B = \varnothing$

This foundation enables lossless, one-pass, distributed statistics under strict space constraints, supporting streaming, OLAP, and parallel analytics.

6. Empirical Results

Experiments consistently demonstrate the advantage of theoretically-founded merging strategies over naive parameter averaging:

Method	Vision (ViT-B/32, 8 tasks)	Language (GPT-2, 7 tasks)	Code Generation (LLaMA2-7B, MBPP@1)
Best static	81.1% (AdaMerging++)	~70%	13.5%
SAMerging	84.96%	76.86%	33.1%
Sens-Merging+TA	—	—	34.78%

Ablations corroborate that incorporating sensitivity, task-transfer, or per-sample representation matching yields significant performance gains, especially in highly specialized or out-of-distribution settings (Dalili et al., 24 Dec 2025, Liu et al., 18 Feb 2025, Chen et al., 22 Jun 2025).

7. Practical Considerations and Limitations

Data efficiency: Few-shot unlabeled calibration sets suffice for robust coefficient optimization (Dalili et al., 24 Dec 2025).
Scalability: Methods are scalable to large models (ViT, LLaMA2, Mistral), but zero-order search remains computationally intensive for high-dimensional layerwise merging (Yang et al., 14 Aug 2025).
Extensibility: Some strategies are plug-and-play for new merging backbones (SE-Merging, Sens-Merging), whereas others require dedicated calibration data or model-specific variants.
Theory gaps: Both weight disentanglement and the representation auto-adaptation hypotheses remain open for rigorous characterization.
Limitations: Techniques do not yet fully address merging of heterogeneous architectures, generative models, or the optimal selection of representation layers.

8. Conclusion

SAMerging refers to a spectrum of post-hoc merging methods underpinned by rigorous theory and high empirical efficacy. By replacing heuristic or manual coefficient choices with data-driven, flatness-promoting, and sensitivity-aware procedures, these frameworks enable state-of-the-art multitask performance in neural networks and enable lossless distributed analytics in structured data contexts. The central mechanisms—PAC-Bayes generalization, multi-teacher distillation, layerwise and per-sample adaptation, and zero-order optimization—collectively define the modern landscape of advanced model and summary merging for scalable artificial intelligence (Dalili et al., 24 Dec 2025, Liu et al., 18 Feb 2025, Chen et al., 22 Jun 2025, Yang et al., 14 Aug 2025, Batagelj, 2023).