Papers
Topics
Authors
Recent
Search
2000 character limit reached

Differentiable Adaptive Merging (DAM)

Updated 7 March 2026
  • DAM is a framework that adaptively merges model parameters using end-to-end differentiable, gradient-based methods for task-aware integration, applicable in multi-task learning, language model merging, and regression.
  • It optimizes merging coefficients through surrogate losses like entropy minimization and KL divergence, enabling fine control over contributions from expert models while maintaining computational efficiency.
  • Experimental validations in vision, NLP, and regression demonstrate DAM’s practical impact, showing improved accuracy, robustness, and reduced compute costs compared to traditional merging methods.

Differentiable Adaptive Merging (DAM) refers to a class of model composition frameworks in which the parameters or outputs of several locally trained or expert models are merged into a global model through an end-to-end differentiable pipeline, optimizing fine-grained merging coefficients via gradient-based methods. The key characteristic of DAM is its ability to learn the merging recipe in a task-aware, data-driven manner using differentiable surrogate objectives, thereby offering fine control over the contributions from source models at various granularities while maintaining computational efficiency. The term encompasses seminal approaches in multi-task learning, LLM merging, regression on heterogeneous data, and even multi-concept model immunization, sharing the principle of adaptive, differentiable, and automatic model combination.

1. Mathematical Foundations and Variants

DAM unifies several algorithmic formulations, tailored to the specifics of the problem domain:

  • Model Weight Merging: In deep networks, given NN source models with weights Wil∈RM×dW_i^l \in \mathbb{R}^{M \times d} at layer ll, DAM introduces trainable, per-column scaling matrices Cil=diag(ci1l,...,cidl)C_i^l = \mathrm{diag}(c_{i1}^l, ..., c_{id}^l), yielding merged weights Wl=∑i=1NWilCilW^l = \sum_{i=1}^N W_i^l C_i^l (Gauthier-Caron et al., 2024).
  • Task Vector Arithmetic: For KK fine-tuned models parametrized by {θk}\{\theta_k\} on different tasks (with θpre\theta_\text{pre} as backbone), DAM learns task-wise (λk\lambda_k) or layer-wise (λkl\lambda_k^l) scaling for each task vector Tkl=θkl−θprelT_k^l = \theta_k^l - \theta_\text{pre}^l, composing θMTLl=θprel+∑k=1KλklTkl\theta_{MTL}^l = \theta_{\text{pre}}^l + \sum_{k=1}^K \lambda_k^l T_k^l (Yang et al., 2023).
  • Partition of Unity (PU) for Regression: In regression over heterogeneous domains, DAM denotes constructions where local predictors {fj(x)}\{f_j(x)\} in regions BjB_j are blended globally by non-negative, locally supported partition-of-unity weights wj(x)w_j(x), ensuring ∑jwj(x)=1\sum_j w_j(x) = 1 and global differentiability: f^(x)=∑j=1mwj(x)fj(x)\hat{f}(x) = \sum_{j=1}^m w_j(x) f_j(x) (Han et al., 2023).
  • Differentiable Constraint-based Merging: For multi-concept model immunization, DAM defines a constrained quadratic problem (often with closed-form solution) over a subset of parameters (e.g., attention projections), optimizing to align target concept embeddings while regularizing benign behavior, with gradients propagated through the merge layer (Zheng et al., 2024).

The granularity of merging coefficients is central: from a global scalar (Task Arithmetic), to per-task, per-layer, per-column (in DNNs), or even per-feature, depending on the domain and computational tradeoffs.

2. Optimization Objectives and Surrogate Losses

DAM algorithms employ differentiable objectives to adaptively learn merging coefficients, either with or without labeled data:

  • Entropy Minimization: In unsupervised multi-task contexts, DAM minimizes the mean Shannon entropy over predictions on unlabeled, mixed-task batches, using it as a proxy for confidence and task-alignment: L(λ)=∑k=1K∑xi∈BkH(fθMTL(λ)(xi))L(\lambda) = \sum_{k=1}^K \sum_{x_i \in B_k} H(f_{\theta_{MTL}(\lambda)}(x_i)) (Yang et al., 2023).
  • Kullback-Leibler Divergence (KL): In LLM merging, DAM often minimizes the average KL divergence between the merged model’s output and those from the source models on corresponding representative datasets: LKL=∑i=1NKL(pmerged(Di)∥pi(Di))L_{\text{KL}} = \sum_{i=1}^N \mathrm{KL}(p_{\text{merged}}(D_i) \Vert p_i(D_i)) (Gauthier-Caron et al., 2024).
  • Auxiliary Penalties: Several implementations utilize cosine-similarity penalties across merge coefficient vectors to encourage orthogonalization, and L1/L2L_1/L_2 regularization for numerical stability and coefficient sparsity (Gauthier-Caron et al., 2024). No hard constraints on the coefficients are typical, other than regularization.
  • Constrained Optimization: For model immunization and certain adaptive merging regimes, the merge step solves a differentiable constrained quadratic program, often regularizing the merged weights for benign tasks while enforcing concept-specific orthogonality (Zheng et al., 2024).

Optimization proceeds via gradient descent (often Adam or SGD) over the vector of merging coefficients, with gradients efficiently propagated through both the merge operation and, when applicable, the source models.

3. Algorithmic and Implementation Procedures

The general DAM workflow can be summarized as follows (abstracting over application settings):

  1. Initialization: Set the merging coefficients (scalars, vectors, or matrices) to an initial value (commonly unity).
  2. Data Preparation: For data-driven variants, assemble a small representative dataset per source model or task. For data-free variants, use unlabeled samples or model predictions.
  3. Iterative Optimization:
    • Generate merged weights via the parameterized merge rule.
    • Compute the surrogate loss (entropy, KL, or other composite objectives) on appropriate samples.
    • Backpropagate gradients through both the merge map and, if needed, the network itself, updating the merging coefficients.
    • Optionally, apply coefficient regularization or explicit projection.
  4. Termination: Stop when loss convergence criteria (e.g., ∣Lt+1−Lt∣<ϵ|L_{t+1} - L_t| < \epsilon) are met.
  5. Deployment: Use the merged model parameters for downstream inference or adaptation.

DAM’s computational cost is dominated by forward and backward passes for small mini-batches, and, due to freezing base model parameters, remains orders-of-magnitude lower than full evolutionary search or re-training (Gauthier-Caron et al., 2024).

4. Experimental Validation and Results

DAM frameworks have been validated across vision, NLP, regression, and diffusion models:

  • Vision Multi-task Merging (Yang et al., 2023, Wei et al., 2 Jan 2025): On CLIP-ViT backbones, layer-wise AdaMerging achieves 80–90% mean test accuracy (ViT-B/32 and ViT-L/14), yielding 7–11 percentage point gains over fixed-coefficient merging. Unsupervised entropy minimization enables strong generalization to unobserved tasks (+8.3% on MNIST/EuroSAT) and enhanced robustness under common corruptions (e.g., +11.2% under motion blur).
  • LLM Merging (Gauthier-Caron et al., 2024): DAM, with per-column scaling and KL-based loss, matches or surpasses evolutionary merging on benchmark suites (e.g., JP LM, Okapi, SQL-Eval) at a tenth of the compute cost. Model Soups remain competitive when source models are highly similar.
  • Local Regression with PU Stitching (Han et al., 2023): DAM significantly outperforms KRR, SVM, XGBoost, and nets on synthetic, PDE, environmental, and combustion benchmarks. Its theoretically accelerated convergence stems from polynomial reproduction in local models and global differentiability from partition-of-unity stitching.
  • Model Immunization (Zheng et al., 2024): In diffusion models, DAM-based MIMA yields larger similarity and adaptation gap ratios on multi-concept adversarial immunization (e.g., +3.75–6.36% mean similarity gap over joint training baselines), while preserving benign functionality.

In all evaluations, DAM closes much of the gap to joint multi-task training or independently fine-tuned experts, while requiring no access to the original training data.

DAM stands in contrast to, and interpolates between, multiple prior merging approaches:

Method Data Requirement Granularity Optimization
Model Soups None Global/Layers Simple Average
TIES-Merging None Layer/Feature Sign/Mag Pruning
AdaMerging Unlabeled Test Data Task/Layer Entropy Minimization
Evolutionary Merging Target Datasets Weight/Layer/Feature Black-Box Evolution
DAM Logits or Small Datasets Per-Column, Layer Gradient-Based
PU-KRR (Regression) Full Training Set Local Regions Block Solve, Smooth

DAM offers end-to-end differentiability and avoids black-box or manual hyperparameter tuning, affording fine-grained control and robustness, but does require access to model outputs or small labeled/unlabeled datasets. When model similarity is high, data-free methods (simple averaging) remain competitive (Gauthier-Caron et al., 2024).

6. Applications, Extensions, and Limitations

DAM has demonstrated impact in diverse domains:

  • Multi-task Learning: Merging expert models without original data, achieving state-of-the-art multi-task performance (Yang et al., 2023).
  • LLM Integration: Efficient, robust merging across domains/languages with orders-of-magnitude reduced compute (Gauthier-Caron et al., 2024).
  • Adaptive/Personalized Regression: Handling data with variable local complexity and density (Han et al., 2023).
  • Model Immunization: Preventing adaptation to harmful concepts via adversarially optimized, differentiable merge layers (Zheng et al., 2024).

Limitations include possible ill-conditioning in constrained merging for highly colinear concepts, necessity for small datasets or logits (except in entirely unsupervised variants), and computational or memory costs scaling with the merge coefficient space or target model size in some regimes. In model immunization, strong guarantees require careful regularization and choice of constraints (Zheng et al., 2024).

A plausible implication is that DAM frameworks will continue to provide a principled backbone for model composition, task transfer, and robustification in future multi-purpose and defense-oriented AI systems, particularly as diverse expert model repositories proliferate.

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Differentiable Adaptive Merging (DAM).