LLM Merging Techniques Overview

Updated 30 July 2025

LLM merging is a process for combining independently fine-tuned language models using parameter-space techniques to retain and integrate domain-specific skills.
It leverages methods such as linear averaging, geometric interpolation, and sparsity-driven masking to balance conflicting updates and preserve performance.
Advanced merging frameworks facilitate rapid capability expansion, modular customization, and enhanced safety alignment in multi-domain NLP systems.

LLM merging encompasses a diverse suite of parameter-space and function-space methods for combining pretrained or fine-tuned models to yield a unified neural network with aggregate capabilities. Unlike multi-task training, which requires joint access to all domains and data, merging leverages existing specialized models—each potentially trained on different objectives, orderings, or modalities—while often bypassing full retraining. This paradigm addresses demands for modularity, computational efficiency, rapid capability expansion, and customizable alignment (helpfulness, honesty, harmlessness, or safety) within large-scale NLP systems. Recent years have seen the rise of both training-free and hybrid strategies, spanning linear, nonlinear, sparsity-aware, geometric, data-driven, and search-based approaches. Evaluations reveal critical trade-offs regarding knowledge retention, interference, forgetting, generalization, and system cost.

1. Foundations and Motivations in LLM Merging

LLM merging formalizes the process of aggregating multiple (often independently fine-tuned) LLMs into a single multi-domain model via parameter manipulation, without access to all underlying training data or requiring full re-training (2505.10833). The canonical arithmetic view expresses the merged parameter tensor as:

$\theta_{\text{merged}} = \theta_0 + \sum_{i=1}^n \lambda_i \cdot (\theta_i - \theta_0)$

where $\theta_0$ is the base model, $\theta_i$ is each specialized model, and $\lambda_i$ controls each “task vector’s” (difference vector’s) contribution. This task arithmetic formulation and its variants underpin many merging pipelines (2505.10833, Liu et al., 28 Mar 2024). Contexts motivating model merging include:

Avoiding multi-task training cost and data consolidation
Retaining or fusing domain expertise (e.g., coding, mathematics, multilinguality, instruction following)
Extending generalization and robustness with negligible incremental computation
Rapidly updating or customizing models (e.g., for alignment or sequential knowledge editing (Fu et al., 14 Jun 2025))
Enabling modular, decentralized LLM development and deployment (Zhang et al., 17 Oct 2024)

Limitations of baseline parameter averaging—such as catastrophic interference, alignment drift, and domain forgetting—have spurred a proliferation of advanced merging techniques addressing sign conflicts, sparsity, scaling, and contribution estimation.

2. Core Parameter-Space and Algorithmic Merging Approaches

LLM merging methods are categorized by their treatment of task vectors, weighting coefficients, structural compatibility, and conflict resolution strategy:

Method Type	Principle	Example Algorithms
Linear/Averaging	Element- or vector-wise mean	Model Soup, Task Arithmetic
Geometric/Interpolation	Spherical interpolation in space	SLERP, Model Stock
Sparsity-Driven/Masking	Drop low-magnitude or conflicting parameters	TIES, DARE, Localize-and-Stitch (2505.10833)
Information-Weighted	Use Fisher Information or RegMean	Fisher Merging, RegMean
Data-Driven/Search-Based	Optimize merging weights or operations via held-out data	Bayesian optimization (Liu et al., 28 Mar 2024), EvoMM, LM-Cocktail
Randomized Linear Mixing	Random interpolation ratio (beta distributed)	Mixup Model Merge (M³) (Zhou et al., 21 Feb 2025)
Hierarchical/Selective	Per-parameter or per-channel selection	Selective Parameter Merging (Ju et al., 1 Oct 2024), Channel Merging (Zhang et al., 18 Dec 2024)

Key advances include:

Sparsification (TIES, DARE) to mask out parameters causing interference
Sign-aware and magnitude-aware pruning (e.g., TIES: retain parameter updates with sign consensus; DARE: random dropout and rescale)
Data-driven procedures to set $\lambda_i$ via validation-based search or multi-objective optimization (Bayesian optimization (Liu et al., 28 Mar 2024), SMAC (Su et al., 6 Feb 2025))
Novel contribution and saliency metrics (OBIM (Wang et al., 17 Feb 2025), Hi-Merging (Fu et al., 14 Jun 2025))
Channel or neuron-level matching to minimize parameter conflict and storage (Channel Merging (Zhang et al., 18 Dec 2024); DOTResize (Verma et al., 6 Jul 2025))
Specialized merging for fine-tuned (FT) and pre-trained (PT) model mixing using weight disentanglement, e.g., WIDEN (Yu et al., 6 Aug 2024)
Automated and multi-fidelity merging frameworks leveraging layerwise and depth-wise search (Su et al., 6 Feb 2025)

Several methods also account for practical realities including dynamic expert selection, uncertainty-aware routing, and memory efficiency (Mediator (Lai et al., 6 Feb 2025), MergeME (Zhou et al., 3 Feb 2025)).

3. Domain Expansion, Specialization, and Alignment via Merging

Model merging enables both capability expansion and tailored alignment across a spectrum of objectives:

Multitask and Multilingual Fusion: Merging models specialized for distinct languages or tasks enables task retention and cross-pollination, often surpassing naively fine-tuned or multi-task-trained models (Yu et al., 6 Aug 2024, Fu et al., 14 Jun 2025). Notably, WIDEN successfully combines pre-trained and fine-tuned models (covering instruction following and multilingual abilities) via weight disentanglement and adaptive fusion.
Mixture-of-Experts (MoE) Merging: Sophisticated merging and routing strategies mitigate parameter interference in MoE architectures, reducing the fine-tuning cost inherent in unweighted averaging (Zhou et al., 3 Feb 2025).
Safety and Alignment: Merging can propagate safety misalignment from one “bad” expert model to the unified model. Explicit inclusion of synthetic safety data and multi-objective optimization in merging loss functions yields merged LLMs with balanced safety and expertise (Hammoud et al., 20 Jun 2024).
3H Optimization: For the help-honesty-harmlessness triad, advanced model merging (RESM) incorporating outlier-aware singular value thresholding and sparsity-rank adaptation provides robust improvements over data-mixture or naive parameter-level methods (Yang et al., 8 Feb 2025).
Knowledge Editing and Continual Updates: Two-stage merging frameworks recover general capabilities lost during robust supervised knowledge editing, by merging the edited and base model with weighted/thresholded delta retention (Fu et al., 14 Jun 2025).
Instruction Tuning & Data Synthesis: LLM-based merging of instruction examples (e.g., MergeIT (Cai et al., 25 Feb 2025)) replaces expensive LLM-based selection, yielding compact yet diverse datasets for fine-tuning.

Merging thus acts as a flexible tool for domain expansion, task retention, and safety or value alignment, often outperforming ensemble or multi-task baselines in resource-constrained or data-isolated settings.

4. Conflict Mitigation, Sparsification, and Selective Retention

Parameter conflicts and knowledge forgetting are central obstacles in LLM merging. Strategies for their mitigation include:

Selective Parameter Merging: Rather than averaging, random or saliency-weighted per-dimension selection preserves task- or order-dependent information (parameter-selection merging (Ju et al., 1 Oct 2024), OBIM (Wang et al., 17 Feb 2025)).
Sparsification and Denoising: Dropping insignificant or noisy task vector contributions before aggregation reduces destructive interference, especially critical when integrating diverse fine-tuned delta vectors (TIES, DARE, Hi-Merging (Fu et al., 14 Jun 2025), Channel Merging).
Layerwise and Channel-Level Adaptation: Channel Merging clusters similar channels across experts for per-channel parameter group merging, maximizing storage efficiency and task specialization retention (Zhang et al., 18 Dec 2024). DOTResize performs neuron merging via entropic, optimal transport-based aggregation, outperforming l2-norm or PCA-based pruning (Verma et al., 6 Jul 2025).
Saliency and Contribution-Based Masking: OBIM utilizes Taylor-derived loss increase metrics to retain only functionally important task vector components (saliency = $½ h_{ii} \delta_i^2$ ); Hi-Merging applies contribution analysis per-layer (effect on task performance of adding/removing each layer’s delta), guiding both pruning and rescaling decisions.
Hybrid Approaches: Mediator combines per-layer conflict measurement (by sign disagreement rate) with dynamic averaging or task-level expert routing, storing only sparse modifier deltas to lower runtime cost and memory (Lai et al., 6 Feb 2025).

These approaches commonly reduce performance loss, catastrophic forgetting, and storage overhead compared to simple averaging or monolithic retraining.

5. Automated, Multi-Objective, and Large-Scale Model Merging

Automated frameworks generalize merging to accommodate a wide range of tasks, objectives, and architectural configurations:

Multi-Fidelity and Search-Based Optimization: Automated frameworks like that in (Su et al., 6 Feb 2025) model the search over merge configurations as a hyperparameter optimization problem, employing multi-fidelity evaluations and surrogate models (Random Forest, Successive Halving) with single- and multi-objective scalarization.
Layerwise Fusion and Depth-wise Integration: Advanced search spaces include Layerwise Fusion (LFS), enabling per-layer method, source, and coefficient selection, as well as Depth-wise Integration (DIS), where the ordering and composition of layers across candidate models is optimized for downstream tasks.
Benchmarks and Evaluation Suites: MergeBench (2505.10833) provides a large-scale, multi-domain suite for standardized evaluation across instruction following, mathematics, multilinguality, coding, and safety, assessing methods on multi-task performance, forgetting (retention of general capability), and runtime cost.
Guidelines from Large-Scale Empirical Studies: Empirical evidence suggests that merging performs best with strong base models, careful coefficient tuning, and incorporation of sparsification/masking. While sparsification mitigates forgetting, methods such as Localize-and-Stitch or RegMean can achieve both task retention and runtime efficiency.
Limitations: Computational overhead for validation-tuned merging scales with model size, and merged models still demonstrate some in-domain performance gap relative to full multi-task-trained counterparts (2505.10833).

The emergence of modular, search-based merging lays the groundwork for scalable, data-efficient, and purpose-adaptive LLM deployment.

6. Specialized and Multimodal Merging Paradigms

Emerging research expands LLM merging into multimodal and fine-grained domains:

Multimodal Expansion via Parameter Decoupling: MMER (Li et al., 21 May 2025) merges multiple multimodal LLMs (MLLMs) using task vector extraction from each, signal-sparse selection, and binary masks for modality-specific parameter decoupling. This approach retains nearly 99% of original capability for each modality and mitigates catastrophic forgetting during sequential task addition, with all operations performed training-free.
Neuron-Level Merging for Compression: DOTResize addresses neuron-level redundancy in LLMs by framing neuron merging as a discrete optimal transport (OT) problem. Soft transport maps project activation patterns across a reduced width, utilizing entropic regularization and QR factorization to ensure compatibility with layer normalization—in contrast to hard pruning or PCA, this preserves the full “signal” while structurally compressing the model (Verma et al., 6 Jul 2025).
Instruction Data Merging: MergeIT (Cai et al., 25 Feb 2025) leverages LLM-based instruction merging operators to synthesize compact, diverse instruction datasets, outperforming LLM-based filtering in instruction tuning efficiency and downstream performance.

These techniques extend model merging far beyond mono-modal or parameter-averaging frameworks, enabling modular fusion even across modalities and architectures.

7. Future Directions, Challenges, and Research Opportunities

Recent studies highlight several open challenges and fertile areas for further development:

Extension to Heterogeneous Backbones: Many merging methods presuppose identical architectures and initializations; generalizing to full heterogeneity, differing layer counts, or multimodal models remains ongoing (Yu et al., 6 Aug 2024, Zhou et al., 3 Feb 2025).
Validation Cost and Scalability: Hyperparameter or coefficient tuning introduces significant computational cost, especially for large models. More sample-efficient, data-free, or proxy-based techniques warrant further pursuit.
Continual and Test-Time Adaptivity: Extending frameworks (e.g., RESM, MMER) to support continual, in-situ, or context-aware merging (adapted during test or deployment) could further minimize retraining and response to dynamic requirements.
Conflict Resolution and Alignment: Hierarchical or dynamic conflict resolution to manage intra- and inter-objective interference (such as the 3H triad, domain/language balancing) is an active area of methodology innovation (Yang et al., 8 Feb 2025).
Real-World Deployment and Modularity: Applying merging in production systems, decentralized development, and on-device/mobile usage (with neuron-level compression or modular routing) remains underexplored, despite promising empirical results for memory efficiency and compositionality.
Theory and Interpretability: Foundational analysis (e.g., the loss landscape of merged models, tangent space formulations, effects of parameter geometry) and interpretability of merged parameter spaces are open theoretical fronts.

Continued research is expected to focus on adaptive, modular, and task-aware strategies for merging, the development of efficient large-scale validation and search, and the extension of model merging principles to multimodal, multi-domain, and decentralized AI architectures.