AdaMerging: Adaptive Model Fusion

Updated 4 January 2026

AdaMerging is a family of adaptive model merging techniques that fuses multiple fine-tuned models by dynamically optimizing coefficients to balance task-specific strengths and reduce interference.
It employs unsupervised entropy minimization along with extensions like AWD, AdaRank, and TADrop to refine merging processes across homogeneous and heterogeneous architectures.
Empirical evaluations reveal significant performance gains over traditional approaches, making AdaMerging a pivotal paradigm in data-free multi-task learning and multimodal fusion.

AdaMerging Technique refers to a family of adaptive model merging methodologies developed to systematically fuse multiple fine-tuned models into a single unified model without access to the original training datasets. Central to these frameworks is the dynamic, data-driven determination of merging coefficients—either globally, per-task, per-layer, or even per-column—to balance the task-specific strengths and mitigate inter-task interference typical in naïve linear parameter averaging. Recent variants also address heterogeneous multimodal architectures, high-rank interference, sparsification heterogeneity, and memory-aware merging, positioning AdaMerging as a versatile paradigm in data-free multi-task learning.

1. Foundations of Adaptive Model Merging

Early model merging approaches, such as Task Arithmetic, directly add task-specific "delta" vectors (differences between fine-tuned and pretrained weights) to create a multi-task model. Formally, given pretrained weights $\theta_\text{pre}$ and fine-tuned models $\theta_k$ for $K$ tasks, delta vectors $T_k = \theta_k - \theta_\text{pre}$ are summed:

$\theta_\text{MTL} = \theta_\text{pre} + \lambda \sum_{k=1}^K T_k$

where $\lambda$ is a scalar. However, naïve averaging often causes severe performance degradation due to conflicting parameter updates and lacks mechanisms to resolve inter-task conflicts (Yang et al., 2023).

AdaMerging techniques introduce learnable coefficients $\alpha_k$ (task-wise) or $\alpha_k^l$ (layer-wise) and formulate the merged parameters as:

$\theta_\text{MTL} = \theta_\text{pre} + \sum_{k=1}^K \alpha_k T_k$

$\theta_\text{MTL}^l = \theta_\text{pre}^l + \sum_{k=1}^K \alpha_k^l T_k^l$

Refinement through variants like AdaMerging++ employs pre-processed $\Phi(T_k)$ for improved sign alignment and delta trimming (as in TIES-Merging) (Yang et al., 2023).

2. Entropy Minimization and Unsupervised Coefficient Learning

AdaMerging's central innovation is the use of unsupervised entropy minimization to select and refine merging coefficients. Instead of supervised tuning, the method minimizes the entropy of model outputs on unlabeled test samples, relying on the empirical correlation between output entropy and true classification error. The surrogate objective is:

$\min_\alpha \sum_{k=1}^K \sum_{x_i \in B_k} H(f_{\theta_\text{MTL}(\alpha)}(x_i))$

where $B_k$ is a batch of task- $k$ samples, and $H(\cdot)$ denotes the per-sample Shannon entropy of the model's prediction. Gradients $\partial H / \partial \alpha$ are computed via backpropagation, allowing efficient optimization with standard optimizers such as Adam (Yang et al., 2023).

Ablation studies demonstrate "layer-wise" AdaMerging achieves substantial gains ( $+$ 11 pp over Task Arithmetic, $+$ 8.7 pp over TIES-Merging, ViT-B/32), with deeper layers often acquiring larger coefficients, reflecting their role in modeling task-specialized features (Yang et al., 2023).

3. Extensions to Heterogeneous and Multimodal Model Merging

Standard AdaMerging presumes homogeneous model architectures. AdaMMS extends adaptive merging to heterogeneous multimodal LLMs (MLLMs), including vision-language transformers with asymmetric and non-overlapping parameter spaces (Du et al., 31 Mar 2025).

Key procedure:

Parameter Mapping: Aligns each weight tensor in source model $M_1$ to a compatible tensor in $M_2$ , or leaves it unchanged when no counterpart exists.
Weight Merging: Interpolates task vectors via

$\theta_\text{out} = \theta_0 + (1 - \alpha) \tau_1 + \alpha \tau_2$

or, piecewise, $\theta^{i}_\text{out} = \theta_1^i$ if unmapped, $(1-\alpha)\theta_1^i + \alpha f(\theta_1^i)$ otherwise.

Unsupervised Coefficient Search: Selects $\alpha^*$ minimizing adjacent differences in output responses ( $D(\alpha)$ ), using unlabeled samples and a grid ( $\alpha\in[0,0.6]$ ), robust to small validation sets and insensitive to metric specifics (Du et al., 31 Mar 2025).

Experiments reveal AdaMMS achieves strong gains, e.g., $+$ 26.84 and $+$ 31.23 on SUM metrics versus non-adaptive baselines for 7B-parameter MLLMs across diverse benchmarks (Du et al., 31 Mar 2025).

4. Advanced Variants: Orthogonalization, Rank Pruning, and Sparsification

Recent works augment AdaMerging with additional modules to maximize merging efficacy:

Adaptive Weight Disentanglement (AWD): Theoretically justifies orthogonalization of task vectors to minimize interference ( $G_i \approx k_i \sum_{j\ne i} \lambda_j \langle \tau_i, \tau_j \rangle$ vanishes if $\langle \tau_i, \tau_j \rangle = 0$ ) (Xiong et al., 2024). By minimizing the average pairwise cosine and controlling redundant vector $\delta$ via an $\ell_2$ norm constraint, AWD extracts near-orthogonal task vectors $\hat{\tau}_i = \tau_i - \delta$ , empirically yielding consistent accuracy boosts (1–3 points across benchmarks).
AdaRank: Applies adaptive rank pruning over singular vector decompositions of delta tensors. Binary masks $B^l_i \in \{0,1\}^R$ select or prune singular directions during test-time, with mask optimization via entropy minimization. AdaRank outperforms fixed top- $k$ truncation, lowering multi-task interference and achieving nearly the performance of individually fine-tuned models (Lee et al., 28 Mar 2025).
Tensor-Wise Adaptive Drop (TADrop): Addresses intra-model heterogeneity by assigning a quantile-ratio based sparsification rate to each tensor, preserving critical heavy-tailed parameters and aggressively pruning redundant ones. TADrop’s norm-preserving scaling ensures task-vector magnitude consistency, delivering up to $+$ 2.0 points improvement in vision, language, and multimodal merging tasks, with gains scaling positively with the number of tasks merged (Luo et al., 8 Aug 2025).

5. Practical Algorithmic Frameworks and Implementation

The AdaMerging workflow typically involves the following steps:

Model Preparation: Assemble task vectors from fine-tuned models and a common initialization.
(Optional) Preprocessing: Apply trimming, sign-correction (TIES), sparsification (TADrop), or orthogonalization (AWD).
Coefficient Optimization: Minimize surrogate entropy (or other data-free objectives) over unlabeled samples to adapt per-task or per-layer coefficients.
Fusion: Synthesize merged weights using the optimized coefficients and preprocessed task vectors.
Deployment: Evaluate on target tasks for accuracy, robustness, distributional generalization.

The core algorithm is efficiently implementable via mini-batch updates and standard deep model toolkits. Integration with SVD-based decompositions (AdaRank) or tensor-wise quantile computation (TADrop) incurs minimal additional computational cost.

6. Quantitative Performance and Robustness

Empirical results underscore AdaMerging’s efficacy across image classification, NLP, and multimodal benchmarks. Notable metrics include:

Method	Vision (ViT-B/32)	Language (RoBERTa-Base)
Task Arithmetic	69.1%	67.8%
Ties-Merging	72.4%	64.7%
AdaMerging++ (LW)	81.1%	—
AWD + AdaMerging	82.9%	—
AdaRank (CART + AR)	89.2% (ViT-B/32)	74.17%
TADrop+EMR (8 vision)	90.7%	—

Performance gains are sustained as the number of tasks increases, with AdaMerging variants consistently less sensitive to scaling coefficients and task count than traditional merging schemes. Layer- or tensor-wise adaptivity is critical to robust generalization and handling distribution shifts, as confirmed by test-time corruptions and unseen task ablations (Yang et al., 2023, Xiong et al., 2024, Luo et al., 8 Aug 2025, Lee et al., 28 Mar 2025).

7. Limitations, Extensions, and Future Directions

AdaMerging frameworks make certain assumptions: architectures are aligned (except AdaMMS), tasks are not drastically divergent, and representative unlabeled data are available. Limitations include incomplete mitigation of extreme cross-task performance gaps, reliance on convexity in search landscapes, and unexplored regimes in non-transformer architectures (Du et al., 31 Mar 2025).

Promising future extensions involve:

Automated mapping for fully heterogeneous model spaces.
Incorporation of non-linear synergies via higher-order interpolation.
Hierarchical and per-element adaptation (AdaRank variants).
Theoretical links between sparsification ratios (TADrop) and information-theoretic redundancy.
Dynamic, sample- or domain-specific schedules for merging coefficients.

These directions aim to further close the gap between data-free merging and supervised multi-task learning, scaling versatility to increasingly heterogeneous model repositories.