Differentiable Adaptive Merging (DAM)

Updated 7 March 2026

DAM is a framework that adaptively merges model parameters using end-to-end differentiable, gradient-based methods for task-aware integration, applicable in multi-task learning, language model merging, and regression.
It optimizes merging coefficients through surrogate losses like entropy minimization and KL divergence, enabling fine control over contributions from expert models while maintaining computational efficiency.
Experimental validations in vision, NLP, and regression demonstrate DAM’s practical impact, showing improved accuracy, robustness, and reduced compute costs compared to traditional merging methods.

Differentiable Adaptive Merging (DAM) refers to a class of model composition frameworks in which the parameters or outputs of several locally trained or expert models are merged into a global model through an end-to-end differentiable pipeline, optimizing fine-grained merging coefficients via gradient-based methods. The key characteristic of DAM is its ability to learn the merging recipe in a task-aware, data-driven manner using differentiable surrogate objectives, thereby offering fine control over the contributions from source models at various granularities while maintaining computational efficiency. The term encompasses seminal approaches in multi-task learning, LLM merging, regression on heterogeneous data, and even multi-concept model immunization, sharing the principle of adaptive, differentiable, and automatic model combination.

1. Mathematical Foundations and Variants

DAM unifies several algorithmic formulations, tailored to the specifics of the problem domain:

Model Weight Merging: In deep networks, given $N$ source models with weights $W_i^l \in \mathbb{R}^{M \times d}$ at layer $l$ , DAM introduces trainable, per-column scaling matrices $C_i^l = \mathrm{diag}(c_{i1}^l, ..., c_{id}^l)$ , yielding merged weights $W^l = \sum_{i=1}^N W_i^l C_i^l$ (Gauthier-Caron et al., 2024).
Task Vector Arithmetic: For $K$ fine-tuned models parametrized by $\{\theta_k\}$ on different tasks (with $\theta_\text{pre}$ as backbone), DAM learns task-wise ( $\lambda_k$ ) or layer-wise ( $\lambda_k^l$ ) scaling for each task vector $T_k^l = \theta_k^l - \theta_\text{pre}^l$ , composing $\theta_{MTL}^l = \theta_{\text{pre}}^l + \sum_{k=1}^K \lambda_k^l T_k^l$ (Yang et al., 2023).
Partition of Unity (PU) for Regression: In regression over heterogeneous domains, DAM denotes constructions where local predictors $\{f_j(x)\}$ in regions $B_j$ are blended globally by non-negative, locally supported partition-of-unity weights $w_j(x)$ , ensuring $\sum_j w_j(x) = 1$ and global differentiability: $\hat{f}(x) = \sum_{j=1}^m w_j(x) f_j(x)$ (Han et al., 2023).
Differentiable Constraint-based Merging: For multi-concept model immunization, DAM defines a constrained quadratic problem (often with closed-form solution) over a subset of parameters (e.g., attention projections), optimizing to align target concept embeddings while regularizing benign behavior, with gradients propagated through the merge layer (Zheng et al., 2024).

The granularity of merging coefficients is central: from a global scalar (Task Arithmetic), to per-task, per-layer, per-column (in DNNs), or even per-feature, depending on the domain and computational tradeoffs.

2. Optimization Objectives and Surrogate Losses

DAM algorithms employ differentiable objectives to adaptively learn merging coefficients, either with or without labeled data:

Entropy Minimization: In unsupervised multi-task contexts, DAM minimizes the mean Shannon entropy over predictions on unlabeled, mixed-task batches, using it as a proxy for confidence and task-alignment: $L(\lambda) = \sum_{k=1}^K \sum_{x_i \in B_k} H(f_{\theta_{MTL}(\lambda)}(x_i))$ (Yang et al., 2023).
Kullback-Leibler Divergence (KL): In LLM merging, DAM often minimizes the average KL divergence between the merged model’s output and those from the source models on corresponding representative datasets: $L_{\text{KL}} = \sum_{i=1}^N \mathrm{KL}(p_{\text{merged}}(D_i) \Vert p_i(D_i))$ (Gauthier-Caron et al., 2024).
Auxiliary Penalties: Several implementations utilize cosine-similarity penalties across merge coefficient vectors to encourage orthogonalization, and $L_1/L_2$ regularization for numerical stability and coefficient sparsity (Gauthier-Caron et al., 2024). No hard constraints on the coefficients are typical, other than regularization.
Constrained Optimization: For model immunization and certain adaptive merging regimes, the merge step solves a differentiable constrained quadratic program, often regularizing the merged weights for benign tasks while enforcing concept-specific orthogonality (Zheng et al., 2024).

Optimization proceeds via gradient descent (often Adam or SGD) over the vector of merging coefficients, with gradients efficiently propagated through both the merge operation and, when applicable, the source models.

3. Algorithmic and Implementation Procedures

The general DAM workflow can be summarized as follows (abstracting over application settings):

Initialization: Set the merging coefficients (scalars, vectors, or matrices) to an initial value (commonly unity).
Data Preparation: For data-driven variants, assemble a small representative dataset per source model or task. For data-free variants, use unlabeled samples or model predictions.
Iterative Optimization:
- Generate merged weights via the parameterized merge rule.
- Compute the surrogate loss (entropy, KL, or other composite objectives) on appropriate samples.
- Backpropagate gradients through both the merge map and, if needed, the network itself, updating the merging coefficients.
- Optionally, apply coefficient regularization or explicit projection.
Termination: Stop when loss convergence criteria (e.g., $|L_{t+1} - L_t| < \epsilon$ ) are met.
Deployment: Use the merged model parameters for downstream inference or adaptation.

DAM’s computational cost is dominated by forward and backward passes for small mini-batches, and, due to freezing base model parameters, remains orders-of-magnitude lower than full evolutionary search or re-training (Gauthier-Caron et al., 2024).

4. Experimental Validation and Results

DAM frameworks have been validated across vision, NLP, regression, and diffusion models:

Vision Multi-task Merging (Yang et al., 2023, Wei et al., 2 Jan 2025): On CLIP-ViT backbones, layer-wise AdaMerging achieves 80–90% mean test accuracy (ViT-B/32 and ViT-L/14), yielding 7–11 percentage point gains over fixed-coefficient merging. Unsupervised entropy minimization enables strong generalization to unobserved tasks (+8.3% on MNIST/EuroSAT) and enhanced robustness under common corruptions (e.g., +11.2% under motion blur).
LLM Merging (Gauthier-Caron et al., 2024): DAM, with per-column scaling and KL-based loss, matches or surpasses evolutionary merging on benchmark suites (e.g., JP LM, Okapi, SQL-Eval) at a tenth of the compute cost. Model Soups remain competitive when source models are highly similar.
Local Regression with PU Stitching (Han et al., 2023): DAM significantly outperforms KRR, SVM, XGBoost, and nets on synthetic, PDE, environmental, and combustion benchmarks. Its theoretically accelerated convergence stems from polynomial reproduction in local models and global differentiability from partition-of-unity stitching.
Model Immunization (Zheng et al., 2024): In diffusion models, DAM-based MIMA yields larger similarity and adaptation gap ratios on multi-concept adversarial immunization (e.g., +3.75–6.36% mean similarity gap over joint training baselines), while preserving benign functionality.

In all evaluations, DAM closes much of the gap to joint multi-task training or independently fine-tuned experts, while requiring no access to the original training data.

DAM stands in contrast to, and interpolates between, multiple prior merging approaches:

Method	Data Requirement	Granularity	Optimization
Model Soups	None	Global/Layers	Simple Average
TIES-Merging	None	Layer/Feature	Sign/Mag Pruning
AdaMerging	Unlabeled Test Data	Task/Layer	Entropy Minimization
Evolutionary Merging	Target Datasets	Weight/Layer/Feature	Black-Box Evolution
DAM	Logits or Small Datasets	Per-Column, Layer	Gradient-Based
PU-KRR (Regression)	Full Training Set	Local Regions	Block Solve, Smooth

DAM offers end-to-end differentiability and avoids black-box or manual hyperparameter tuning, affording fine-grained control and robustness, but does require access to model outputs or small labeled/unlabeled datasets. When model similarity is high, data-free methods (simple averaging) remain competitive (Gauthier-Caron et al., 2024).

6. Applications, Extensions, and Limitations

DAM has demonstrated impact in diverse domains:

Multi-task Learning: Merging expert models without original data, achieving state-of-the-art multi-task performance (Yang et al., 2023).
LLM Integration: Efficient, robust merging across domains/languages with orders-of-magnitude reduced compute (Gauthier-Caron et al., 2024).
Adaptive/Personalized Regression: Handling data with variable local complexity and density (Han et al., 2023).
Model Immunization: Preventing adaptation to harmful concepts via adversarially optimized, differentiable merge layers (Zheng et al., 2024).

Limitations include possible ill-conditioning in constrained merging for highly colinear concepts, necessity for small datasets or logits (except in entirely unsupervised variants), and computational or memory costs scaling with the merge coefficient space or target model size in some regimes. In model immunization, strong guarantees require careful regularization and choice of constraints (Zheng et al., 2024).

A plausible implication is that DAM frameworks will continue to provide a principled backbone for model composition, task transfer, and robustification in future multi-purpose and defense-oriented AI systems, particularly as diverse expert model repositories proliferate.