Expert Merging: Principles & Methods

Updated 21 December 2025

Expert-merging procedure is a framework that consolidates multiple specialized neural models into a unified model while preserving each model's expertise.
It employs techniques from linear averaging to curvature-aware, game-theoretic, and subspace-based approaches to effectively resolve parameter conflicts.
Recent studies show that robust merging methods improve task generalization, reduce computational costs, and facilitate scalable, multi-specialist deployments.

An expert-merging procedure is any principled algorithmic framework for consolidating multiple “expert” models—typically neural networks fine-tuned or specialized for different tasks or domains—into a single model that preserves or integrates the expertise of the original constituents. In modern machine learning, expert-merging is critical for scalable deployment of large, sparse Mixture-of-Experts (MoE) architectures, for constructing generalist multitask models from domain-specific specializations, and for compressing expert pools to reduce memory and computational overhead. Expert-merging techniques range from simple linear parameter averaging to curvature-aware or game-theoretic updates, robust optimization in functional or subspace domains, and hybrid procedures combining symbolic rules with deep models. This entry synthesizes core algorithmic principles, theoretical formulations, leading families of algorithms, major empirical findings, and implementation best practices.

1. Mathematical Foundations and Problem Statement

Let $\{\theta_k\}_{k=1}^N$ denote expert models, each comprising a high-dimensional parameter vector (or, more generally, tensors per architectural block or layer). The expert-merging objective is to construct a merged model $\theta_{\mathrm{merge}}$ via

$\theta_{\mathrm{merge}} = f(\{\theta_k\}, \theta_0, \mathcal{D}, \mathcal{H}),$

where $\theta_0$ may be a base model (shared initialization), $\mathcal{D}$ represents optional calibration data, and $\mathcal{H}$ is the set of method-specific hyperparameters or heuristics.

“Task vectors” are widely used: define $\tau_k = \theta_k - \theta_0$ for each expert. The classical “Task Arithmetic” merge is then

$\theta_{\mathrm{merge}} = \theta_0 + \alpha \sum_{k=1}^N \tau_k,$

where the global scale $\alpha$ is selected by validation or fixed (often $\alpha=1/N$ ). More general merges introduce per-task, per-layer, or even channel-specific coefficients, masks, or matrices acting as alignment or denoising operators (Yadav et al., 4 Oct 2024, Nguyen et al., 26 Feb 2025, Zhang et al., 30 Sep 2025).

Formally, the merged model aims to maximize task-completeness, robustness, and generalization by integrating the specialized knowledge in $\{\theta_k\}$ under possibly conflicting, redundant, or nontrivially related parameterizations.

2. Principal Algorithmic Families

Expert-merging algorithms are categorized by the employed geometric, optimization, or structural assumptions.

A. Linear/EUclidean and Static Heuristics

Averaging: $\theta_{\mathrm{merge}} = \frac{1}{N} \sum_k \theta_k$ (Yadav et al., 4 Oct 2024).
Task Arithmetic: $\theta_{\mathrm{merge}} = \theta_0 + \alpha \sum_k \tau_k$ .
TIES/DARE: Coordinate-wise sign alignment, sparsification (prune bottom $1-p$ fraction of task vector entries), and/or Bernoulli dropout to reduce destructive interference (Yadav et al., 4 Oct 2024, Ueda et al., 4 Nov 2025).
Masking: Retention only of entries with sign consensus; TALL mask merging applies subsequent pruning per task (Sharma et al., 16 Oct 2024).

B. Curvature- and Structure-Aware Merging

Natural Gradient/Curvature Preconditioning (CAMEx): Incorporates manifold geometry via learned curvature matrices $M_i$ per expert:

$\hat E_m = E_m + \alpha \sum_{i=1}^{N-1} M_i(s_i \tau_i)$

where $M_i$ is learned by rank-one outer-product updates, approximating the local Fisher metric (Nguyen et al., 26 Feb 2025).

Nash Bargaining (NAMEx): Treats merging as a game-theoretic bargaining problem, yielding updates of the form

$\Delta^{(l)} = \sum_{i=1}^N \alpha_i \tau^{(l)}_i,$

where coefficients $\alpha$ optimize the Nash product over expert utilities $u_i(\delta) = \tau_i^\top \delta$ (Nguyen et al., 17 Oct 2025).

C. Functional, Subspace, and Robust Optimization

Subspace Boosting: Decomposes the merged task vector in each layer by SVD, clamps smaller singular values above a threshold, and reconstructs to prevent rank collapse in high-dimensional merging (Skorobogat et al., 19 Jun 2025).
Sub-MoE: Joint SVD across expert weight matrices gives a shared left basis $U$ , merges expert projections in the right singular space with frequency weighting, and reconstructs merged experts with

$W_{\text{merged}} = U (V_{\text{merged}})^\top.$

(Li et al., 29 Jun 2025).

OptMerge: Centers and denoises per-expert task vectors via low-rank SVD, then robustly optimizes a global interaction loss to prevent noise amplification, summing over all layers (Wei et al., 26 May 2025).

D. Data-Driven/Metaheuristic Strategies

Expert Merging++: Solves for per-layer or per-chunk alignment coefficients by matching merged activations and logits to those of experts, regularized for stability. Importance-guided chunking allocates more coefficients to critical layers (Zhang et al., 30 Sep 2025).
PSO-Merging: Treats merging as a black-box search in parameter space using Particle Swarm Optimization, optimizing on task-specific proxy rewards over a pool of experts and sparsified variants (Zhang et al., 27 Aug 2025).
Online Neural Bandit: Task-aware merging weights selected dynamically via a neural contextual bandit router, balancing past observed task mixture with model performance to select optimal expert-weight vectors in online inference (Han et al., 24 Sep 2025).

E. Fine-Grained and Structure-Preserving Merging

Channel Merging: Clusters per-channel deltas across experts, merging only highly similar parameters within each group and associating per-channel index maps per expert. Maintains specialization and enables scalable reduction in memory use (Zhang et al., 18 Dec 2024).
PuzzleMoE: Entrywise sparse merging via dual-mask (similarity + saliency) logic, using bit-packed encoding for activation-efficient inference. Merging is done at the granularity of individual weights, combining only compatible or important entries (Zhao et al., 6 Nov 2025).
MergeMoE: Output-level merging, representing the merged expert as an optimized linear combination of expert outputs. Compression matrices are solved for by minimizing output approximation errors under practical routing distributions (Miao et al., 16 Oct 2025).

3. Implementation Procedures and Pseudocode Templates

Most algorithms proceed generically as follows:

Task Vector Extraction: Compute $\tau_k = \theta_k - \theta_0$ for each expert; or, for layer $l$ , extract per-layer or per-block tensors.
Alignment/Preprocessing: (If needed) Align experts via permutation matching, subspace alignment, SVD/SAD, or clustering according to functional or parameter similarity (Sharma et al., 16 Oct 2024, Li et al., 29 Jun 2025).
Merging Step: Apply the method-specific combiner (see Table for examples).

Method	Merging Update	Key Formula / Step
Task Arithmetic	$\theta_0 + \alpha \sum \tau_k$	Linear sum, scaled by $\alpha$
TIES	Mask/average (sign-consensus)	Majority sign for each coord, zeroing conflicting entries
Subspace Boost	SVD + clamping	SVD per-block, clamp singulars, reconstruct and sum
CAMEx	Curvature-metric update	$\hat{E}_m = E_m + \alpha \sum M_i(s_i \tau_i)$
ExpertMerge++	Layer-wise chunked coefficients	$\theta_m^\ell = \theta_0^\ell + \sum_k \alpha_k^\ell \tau_k^\ell$
PuzzleMoE	Dual-masked element merge	$W_{\mathrm{merged}}$ via similarity and saliency masks
ChannelMerge	Clustered per-channel sum	Merge within KMeans groupings based on cosine similarity, per-channel

Optional Postprocessing: Apply further fine-tuning, per-task normalization (Sharma et al., 16 Oct 2024), output-layer correction, or offline index-lookup (for fine-grained merges (Zhao et al., 6 Nov 2025, Zhang et al., 18 Dec 2024)).
Evaluation/Selection: Validate merged model using task-specific held-out data, emergent skill metrics, or proxy performance statistics (Ueda et al., 4 Nov 2025).

4. Addressing Parameter Conflicts and Interference

Parameter interference, where independent domain updates destructively interfere during merging, is central to expert-merging efficacy. Several methods directly address this:

Entrywise sparsification and sign conflict elimination (TIES, DARE): Pruning/correcting conflicting indices preserves only reliable, consensus-driven parameter updates (Yadav et al., 4 Oct 2024, Ueda et al., 4 Nov 2025).
Subspace alignment and boosting: By controlling the energy/rank of merged updates, subspace methods prevent dominance of a degenerate direction, alleviating accuracy collapse with many experts (Skorobogat et al., 19 Jun 2025, Li et al., 29 Jun 2025).
Curvature-aware/proxy Fisher-metric methods (CAMEx, NAMEx): Updates sensitive to the underlying Riemannian metric of parameter space adjust step size and directionality to the local geometry, yielding more stable and effective compromise points (Nguyen et al., 26 Feb 2025, Nguyen et al., 17 Oct 2025).
Fine-grained grouping (Channel/PuzzleMoE): Clustering by parameter or activation similarity ensures only truly similar parameters are merged, drastically reducing cross-expert contamination (Zhang et al., 18 Dec 2024, Zhao et al., 6 Nov 2025).

When merging more than two or three highly divergent experts, empirical evidence shows that naive merges often induce rank collapse, instability, and degraded specialization unless such mechanisms are established (Skorobogat et al., 19 Jun 2025, Yadav et al., 4 Oct 2024).

5. Empirical Findings and Performance Characteristics

Comprehensive empirical studies across LLMs (Mistral, Llama-3, T5), multimodal transformers (InternVL, Qwen2-VL), and vision/backbone variants reveal:

Task Arithmetic, TIES, and similar static merges recover most expert-specific performance in large models ( $\geq$ 10B) but are sensitive to scale and task vector magnitudes (Yadav et al., 4 Oct 2024).
Subspace, curvature, and output-level methods (CAMEx, Subspace-Boosted, OptMerge, MergeMoE) robustly improve generalization, convergence, and performance recovery, especially with larger or more heterogeneous expert sets (Nguyen et al., 26 Feb 2025, Wei et al., 26 May 2025, Miao et al., 16 Oct 2025).
Fine-grained structural merging (Channel Merging, PuzzleMoE, etc.) achieves parameter reduction (e.g., 50% storage and inference speedup) at negligible or sub-1% performance loss, benefiting high-expert MoEs (Zhang et al., 18 Dec 2024, Zhao et al., 6 Nov 2025).
Emergent capabilities: Under certain task and domain pairings, merging gives rise to emergent skills not present in any constituent, though this is not reliably predicted by model similarity alone (Ueda et al., 4 Nov 2025).
Curvature- and game-theoretic merges provide consistent gains in robustness, out-of-domain accuracy, and label-efficiency (Nguyen et al., 26 Feb 2025, Nguyen et al., 17 Oct 2025).

Empirical ablations identify critical hyperparameters (prune ratio, clamping thresholds, merged layer rank, merge coefficients), and establish that large models absorb more experts with less accuracy loss, while small models degrade rapidly past $N\approx4$ experts (Yadav et al., 4 Oct 2024).

6. Practical Guidelines and Implementation Considerations

Several general guidelines emerge across the literature:

Initialization: Use the strongest available base/checkpoint for pre-merge alignment (Yadav et al., 4 Oct 2024, Ueda et al., 4 Nov 2025).
Calibration data: Few-shot unsupervised alignment suffices for coefficient learning in parameter-efficient/training-light schemes (Zhang et al., 30 Sep 2025).
Layer and channel granularity: Finer merging (per-channel or per-chunk) consistently outperforms monolithic layer or global merges, at manageable memory cost (Zhang et al., 18 Dec 2024, Zhang et al., 30 Sep 2025).
Curvature/complexity: Low-rank/kronecker approximations keep extra computation/memory nearly linear in parameter size; chunked/importance-guided allocation focuses computational cost where most impactful (Nguyen et al., 26 Feb 2025, Zhang et al., 30 Sep 2025).
Scale: Sophisticated merging (TIES, DARE, boosting) is vital for small models or high-expert-count assemblies; simple averaging suffices for very large, instruction-tuned models with up to 8 experts (Yadav et al., 4 Oct 2024).
Post-processing: Activation normalization (TACT), per-expert index maps, or per-task router selection may be required for deployment in multi-specialist or compression settings (Sharma et al., 16 Oct 2024, Zhang et al., 18 Dec 2024).

Best practices recommend validation over target tasks, routine hyperparameter sweeps over scale and pruning threshold, and routine memory/latency benchmarking for large-scale or deployment-adjacent use cases.

7. Techniques Beyond Parameter-Space Merging

Hybrid expert-merging also appears in symbolic/AI agent architectures. For example, merging symbolic expert rule-based modules with neural LLM agents occurs via deterministic pipeline composition (rule-based outputs have precedence, LLM is used where no rule matches; conflict resolution prioritizes rules, then longer or higher-confidence segments) (Long et al., 13 Nov 2024). The merging logic is precedence-based, not probabilistic, with systematic prompt engineering and iterative rule refinement validated by precision and inter-coder reliability against human annotation.

Similarly, in causal modeling, merging expert causal graphs requires establishing pairwise compatibility (structural/functional), recursively merging compatible variables or subgraphs, calibrating confidences via Bayesian weights, and decomposing where incompatibility arises. Here, merging is achieved not by parameter addition but by commutative, associative operators across acyclic graph structures, with complexity exponential in the number of experts but polynomial per merge when $n$ is small (Alrajeh et al., 2020).

In summary, expert-merging procedures constitute a diverse set of algorithmic paradigms enabling scalable, robust, and often compressive integration of multiple expert models into high-capacity multitask or generalist systems. Their continued development combines geometric, statistical, combinatorial, and learning-theoretic insights to advance the tractable deployment of specialized and generalist AI at scale (Nguyen et al., 26 Feb 2025, Ueda et al., 4 Nov 2025, Zhang et al., 30 Sep 2025, Zhang et al., 27 Aug 2025, Li et al., 29 Jun 2025, Zhang et al., 18 Dec 2024).