Sparse Expert Merging Techniques
- Sparse expert merging is a set of techniques that combine specialized model components using targeted sparsity to maintain individual expertise and enhance efficiency.
- Key methodologies include mask-based localization, subspace alignment, output clustering, and combinatorial optimization for precise and modular model integration.
- These approaches reduce computational and memory overhead while minimizing destructive interference, thereby supporting scalable, continual updates in large models.
Sparse expert merging refers to a class of methodologies that combine multiple specialized models—typically, domain expert networks or adapters—into a single unified model while leveraging explicit sparsity at the level of parameters, routing, or structural selection. This paradigm is central to the efficient scaling, compression, and modular extensibility of LLMs and sparse Mixture-of-Experts (MoE) architectures. Recent progress encompasses training-light alignment, element- or block-wise parameter selection, subspace-based alignment, hierarchical output clustering, combinatorial optimization, and bit-packed hardware-efficient design. Sparse expert merging is distinguished by constraints or mechanisms that preserve specialization and minimize destructive interference between merged experts, while achieving strong efficiency in both compute and memory.
1. Methodological Foundations of Sparse Expert Merging
Sparse expert merging methods operate at several granularity levels:
- Parameter selection: Only a structurally sparse subset of model parameters is considered for merging, often relying on data-driven saliency or mask learning.
- Chunk- or block-level merging: Certain algorithms refine merging precision by assigning coefficients to localized parameter regions or blocks, guided by a layer- or chunk-importance metric.
- Subspace decomposition: Methods such as joint SVD (as in Sub-MoE) enforce a shared latent basis across experts, merging only expert-specific components within an aligned subspace.
- Expert output aggregation: Some frameworks (e.g., MergeMoE) merge entire expert outputs using auxiliary aggregation matrices to approximate the behavior of the original ensemble, allowing compression without dense parameter mixing.
- Entrywise selection and hashing: Approaches such as PuzzleMoE perform entrywise similarity checks, employing dual-masks to capture redundancy and saliency, and exploit hardware-friendly bit-packing.
Across these methods, the primary goals are to reduce memory and/or computational overhead, preserve or even enhance downstream performance, and allow for modular updates to the expert set with minimal retraining.
2. Unsupervised Alignment and Importance-Guided Merging
A prominent category is "training-light" or unsupervised alignment-based merging, as exemplified by Expert Merging and Expert Merging++ (Zhang et al., 30 Sep 2025). Here, multiple task-finetuned experts are linearly combined at the layer level:
where coefficients are learned via minimization of hidden-state and logit alignment losses on a small, unlabeled calibration set, supplemented by regularization to maintain coefficient stability. This approach resolves the challenge of aligning internal representations and output distributions between heterogeneous experts, ensuring the merged model consistently approximates each domain expert without requiring access to ground-truth labels.
Expert Merging++ further refines this approach by allocating more merging coefficients to "important" layers, defined by the magnitude of task-vectors, coefficient norms, and parameter counts. Layer chunking thus concentrates capacity in regions sensitive to task specialization, reducing unnecessary redundancy elsewhere. Empirically, this strategy outperforms heuristic and purely training-free approaches by 1–2 points on a broad set of language and vision tasks, and can even surpass supervised mixture training under sufficient calibration (Zhang et al., 30 Sep 2025).
3. Localization-Based Sparse Merging and Modular Stitching
Localization- and mask-based sparse merging algorithms, notably Localize-and-Stitch (He et al., 24 Aug 2024), proceed via two explicit steps:
- Localization: For each expert task-vector , learn a binary mask identifying a tiny, task-critical subset () of parameters. Mask optimization uses sparse, regularized loss minimization with occasional fallback to magnitude-based heuristics in dataless scenarios.
- Stitching: Merged parameters are constructed as
where resolves overlaps by equal averaging in mask intersections.
This localizes expert modifications to non-overlapping regions, drastically reducing interference and storage (to ~1% per expert per task) and enabling continual addition of new experts. Sparse mask composition avoids the catastrophic interference observed in naive full-parameter merging. Localize-and-Stitch surpasses previous global and sparse merging baselines in both efficiency and accuracy for both language and vision models (He et al., 24 Aug 2024).
4. Clustering, Subspace Alignment, and Output Merging
Several recent approaches employ clustering and subspace methods to address parameter conflicts and enhance functional coherence among merged experts:
- Hierarchical Clustering (HC-SMoE): Experts are embedded by their average output activations over a calibration set and grouped by cosine distance; each cluster is merged by weighted averaging or dominant center selection. Empirical results on large MoEs (Mixtral, Qwen) show that memory can be halved at a cost of only 3–7% loss in average accuracy, outperforming one-shot pruning or static merges (Chen et al., 11 Oct 2024).
- Subspace Expert Merging (Sub-MoE): K-means clustering is performed on expert output profiles, followed by joint SVD decomposition within clusters to derive a shared left singular (U) basis. The right singular vectors (V) are merged via frequency-weighted averages, reconstructing each “super-expert” as . This approach significantly reduces parameter conflicts and maintains 85–97% of zero-shot accuracy under 25–50% expert reduction, outperforming frequency- or output-prune baselines (Li et al., 29 Jun 2025).
- Output Aggregation (MergeMoE): The merged expert outputs are represented as , where B and A are block-structured aggregation matrices reflecting expert-to-cluster assignments and usage frequency. This optimization-based formulation yields closed-form solutions for merging and preserves the original router sparsity, resulting in strong empirical performance at high compression ratios (Miao et al., 16 Oct 2025).
These approaches highlight that output-level functional alignment and shared latent subspaces are crucial for high-fidelity sparse merging in the presence of expert specialization.
5. Elementwise and Bit-Packed Sparse Merging for Efficient MoE Compression
Elementwise sparse merging is exemplified by PuzzleMoE (Zhao et al., 6 Nov 2025), which operates as follows:
- Elementwise similarity masking: For experts , a per-weight mask identifies entries sufficiently close in magnitude (redundant) and averages them; otherwise, a saliency metric selects the more important entry. Dual binary masks capture expert-specific and expert-shared indices.
- Hardware-optimized bit-packing: All masking and sign information is encoded into unused exponent bits of BFloat16, permitting zero-overhead storage and real-time decoding by custom CUDA kernels at inference.
Empirically, at 50% expert merging (e.g., Mixtral-8×7B), PuzzleMoE outperforms both expert dropping and coarse merge methods by up to 16.7 points on MMLU, achieving near-parity with the original accuracy and enabling up to 1.28× inference speedup (Zhao et al., 6 Nov 2025). This demonstrates the efficacy of fine-grained merging for reducing memory and compute costs without loss of capacity.
6. Optimization-Based, Adapter, and Online Sparse Expert Merging
- PSO-Merging: Model merging is recast as a Particle Swarm Optimization over full parameter vectors, where the initial swarm includes both dense experts and their random sparse (dropout) projections. Parameter updates follow classical PSO rules, and the global best particle after a fixed number of steps forms the merged model. PSO-Merging offers efficient, scalable, and memory-light merging, consistently outperforming gradient-based and heuristic baselines on multitask LLM evaluations (Zhang et al., 27 Aug 2025).
- Sparse Adapter Merging: Merging of task adapters with hard pruning masks () and elementwise normalizers avoids destructive interference. Max Connection Sensitivity is employed for mask selection, and adapters are merged by averaged summation normalized by overlap count. This approach yields superior in-distribution performance post-merging relative to LoRA or full model merging and is robust to merging up to 20 experts (Arnob et al., 9 Jul 2025).
- Online Task-Aware Merging (Tanbr): For online inference in SMoE architectures, a tree-structured neural bandit guides merging weight selection in a high-dimensional simplex, with performance feedback derived from aggregate task distributions over time. Tanbr achieves up to 78% memory reduction and 45–65% inference speedup, while converging to within 1–2% of full MoE accuracy in dynamic task settings (Han et al., 24 Sep 2025).
7. Game-Theoretic and Principled Sparse Expert Merging
Game-theoretic formulations, such as Nash Bargaining (NAMEx), introduce a principled weighting mechanism for sparse expert merging by balancing cooperative and competitive dynamics among experts (Nguyen et al., 17 Oct 2025). The Nash solution maximizes the product of expert utilities, leading to optimal linear weights for domain-vectors and providing stability against dominated or adversarial merges. A complex momentum extension accelerates expert propagation and ensures fast, contractive convergence with theoretical guarantees. NAMEx consistently improves perplexity, accuracy, and robustness over heuristic and curvature-aware merging baselines across diverse language and vision tasks, and scales efficiently to large MoE systems such as Qwen1.5-MoE and DeepSeek-MoE.
Sparse expert merging thus constitutes a diverse set of techniques—ranging from mask-based composition, subspace alignment, output clustering, and combinatorial optimization to game-theoretic combination—that enable scalable expert integration, modular upcycling, and practical deployment of large MoE architectures. These methods empirically address the trade-offs between memory, compute, and specialization, providing robust solutions that approach or surpass the performance of full, uncompressed expert ensembles (Zhang et al., 30 Sep 2025, He et al., 24 Aug 2024, Chen et al., 11 Oct 2024, Li et al., 29 Jun 2025, Zhao et al., 6 Nov 2025, Arnob et al., 9 Jul 2025, Zhang et al., 27 Aug 2025, Miao et al., 16 Oct 2025, Han et al., 24 Sep 2025, Nguyen et al., 17 Oct 2025).