AdaMerging: Adaptive Neural Model Merging
- AdaMerging is a framework of adaptive, automated, hyperparameter-free techniques that merge multiple pre-trained models into a single multi-task and cross-domain model without retraining.
- It employs optimized layerwise and blockwise strategies with gradient-based updates and multi-fidelity search to identify effective merging configurations.
- Empirical evaluations show substantial accuracy, generalization, and robustness gains over classical methods, with improvements up to 11% and enhanced performance under distribution shifts.
AdaMerging refers to a family of adaptive, automated, and hyperparameter-free methodologies for merging multiple pre-trained or fine-tuned neural models—typically for multi-task capability or cross-domain generalization—without requiring retraining on original task data. AdaMerging algorithms optimize or search for model “mixing” coefficients, layerwise strategies, or blockwise arrangements to overcome the rigidity and limitations of uniform or hand-tuned merging, providing substantial empirical advances in merged-model accuracy, generalization, and robustness over classical approaches such as task arithmetic or weight averaging (Su et al., 6 Feb 2025, Yang et al., 2023).
1. Foundations and Motivation
AdaMerging addresses the challenge where several specialized models—each fine-tuned for distinct tasks or domains—must be fused into a single model that inherits the parent capabilities without retraining. Classical merging approaches, such as linear averaging or naive task arithmetic (e.g., ), typically require manually set hyperparameters and often degrade multi-task performance due to neglect of task/task or layer/layer conflicts and intricate parameter correlations (Yang et al., 2023).
Automated or adaptive merging converts the choice of merging coefficients and merging strategies into a learning or search problem, leveraging available unlabeled data or black-box objective proxies. This transition from hand-tuning to automated adaptive search aims to streamline merging workflows (eliminating extensive grid search) and to unlock Pareto-efficient operating points unattainable by uniform schemes.
2. Core AdaMerging Methodologies
The canonical AdaMerging approach (Yang et al., 2023) parameterizes the merged weights as:
- Task-wise:
- Layer-wise: for each layer
The algorithm, in its core instantiation, solves: where is the entropy over the model's predictions on batches of unlabeled, multi-task examples. Lower entropy acts as a surrogate for robust multi-task classification; empirically, this correlates (Spearman over eight tasks) with lower supervised loss.
Optimization proceeds via gradient-based updates on using Adam, leveraging backpropagation through the merged model, with initialized (e.g., 0.3), and projected to remain non-negative at each step. The method supports both task-wise (per-model) and fine-grained layerwise coefficients, allowing for significant improvements over uniform merging.
3. Automated Model Merging Frameworks
More recent AdaMerging frameworks generalize adaptive model merging to highly structured and modular search spaces (Su et al., 6 Feb 2025). Two prominent search spaces are:
- Layer-wise Fusion Search (LFS): The model is partitioned into groups of layers, each with component types (e.g., MLP, attention, layer-norm), allowing choice among multiple merge operators—such as Task Arithmetic, TIES-Merging, SLERP, or Linear Merge—and their hyperparameters per group/component. The search variable encodes, for each group and component, both operator selection and the associated hyperparameters.
- Depth-wise Integration Search (DIS): The model is broken into blocks (blockwise sequence of layers). DIS searches over compositional arrangements—block selections across parents, permutations of block orderings, repetition/pruning indicators, and scaling factors for output calibration—allowing for nontrivial topology changes.
Both LFS and DIS dramatically expand the expressive power of merging beyond simple coefficient learning, supporting per-layer and per-block operator heterogeneity, reordering, and pruning.
4. Multi-Fidelity Optimization and Search Algorithms
Given the high cost of evaluating candidate merges (single forward pass on a large validation set is costly for LLMs), AdaMerging leverages multi-fidelity optimization (Su et al., 6 Feb 2025). Validation size becomes a “fidelity” budget parameter , and the search space is evaluated in brackets of increasing budget (Hyperband). Most configurations are discarded at low fidelity, and only a small fraction advances to full evaluation.
Optimization proceeds by coupling Bayesian surrogate models (e.g., Random Forest regressors over configuration , where denotes merge parameters/search configuration) with sample-efficient acquisition (e.g., Expected Improvement). For multi-objective settings (e.g., maximizing multiple task performances), the objective is scalarized via the ParEGO method (augmented Tchebycheff), and the Pareto frontier is constructed via repeated runs. This process automates the prior manual tuning of merging hyperparameters and allows for effective discovery of superiorly merged models within a constrained compute budget (typically within several hundred search steps).
5. Empirical Outcomes and Benchmarks
Comprehensive evaluation on benchmarks such as image classification (e.g., SUN397, Stanford Cars, EuroSAT, MNIST) and LLM tasks (e.g., GSM8K, MATH, MMLU, MBPP) demonstrates (Yang et al., 2023, Su et al., 6 Feb 2025):
- Layer-wise AdaMerging exhibits substantial accuracy gains over classical schemes. For ViT-B/32 on 8 tasks, layer-wise AdaMerging yields 80.1% average accuracy (+11.0% over Task Arithmetic and +7.7% over TIES-Merging).
- In generalization settings (merging on 6 of 8 tasks, testing on held-out tasks), AdaMerging achieves 4–9% uniform improvement over classical baselines.
- Robustness under distribution shift is substantially enhanced: AdaMerging provides 6–11% higher accuracy under CIFAR-style corruptions.
- In LLM settings, single-objective AdaMerging LFS produces +4.24% gains on GSM8K, and DIS produces up to +1.5% on MMLU; multi-objective search achieves aggregate improvement (+6.86% tri-task average over best parent).
Ablation studies confirm that layer-wise parameterizations and fine-grained search improve accuracy over uniform/tied merging and that modest batch sizes and step counts suffice for convergence. The method scales favorably with respect to compute, as it operates only over small search spaces or via surrogate-guided bracketed evaluations.
6. Practical Guidelines and Limitations
Layer-wise AdaMerging is recommended where parameter dimension is manageable, as it dramatically outperforms coarse task-wise techniques. A few thousand unlabeled, test-like samples suffice for robust coefficient learning; no access to fine-tuning data or labels is required. Optimization is efficiently performed with Adam at lr 1e-3 and batch sizes 16–32, and is clipped to remain nonnegative.
Limitations include the need for model parameter alignment (i.e., all models must share the same base architecture) and sufficient similarity in parent models to enable successful information fusion; large divergences in parent training or task objectives can challenge unsupervised AdaMerging (Yang et al., 2023, Su et al., 6 Feb 2025).
Search-based AdaMerging (LFS/DIS) introduces new tuning axes (search granularity, block size, operator set), and over-parameterization can exhaust search budgets without further gains. The approach’s modularity enables cross-modal or blockwise merging—potentially applicable to vision-language or hierarchical model fusion—though such directions remain to be maturely explored.
7. Adaptivity and Potential Extensions
Adaptivity in AdaMerging manifests through multi-fidelity scheduling, automated per-layer/per-block operator and hyperparameter selection, and online Bayesian surrogate updating as new performance data is collected. Extension opportunities include online merging (adapting fusion strategies dynamically as new tasks arrive), dynamic fidelity scheduling, cross-modal model merging, and incorporating compute or latency into Pareto-optimal search fronts. Hierarchical expansion of DIS (search over block sizes and recursive submodule merges) also represents a promising avenue toward modular, scalable model merging (Su et al., 6 Feb 2025).
AdaMerging thus provides a flexible, efficient, and generalizable framework for combining capabilities of multiple models without costly retraining, establishing a new empirical and methodological foundation for automated model integration in multitask and multidomain settings.