Large Language Model Merging

Updated 13 March 2026

Large language model merging is the process of combining fine-tuned LLM checkpoints using techniques like weight averaging and gradient-based optimization to enable multi-task functionality.
Empirical findings show that linear mode connectivity allows effective arithmetic blending of task-specific deltas, yielding reliable performance improvements and reduced retraining costs.
Advanced strategies, including sensitivity-guided and subspace methods, mitigate interference and scale multi-expert models across diverse applications.

LLM merging is the process by which two or more LLMs, typically fine-tuned for distinct tasks or domains but sharing a backbone architecture, are combined into a single unified parameter set capable of multi-domain or multi-task inference—without requiring full retraining. The recent proliferation of specialized LLM variants and increasing focus on modular AI systems have spurred extensive methodological innovation, scaling studies, and a diverse ecosystem of tools and benchmarks. The theoretical and practical landscape of LLM merging is characterized by a spectrum of approaches, from simple weight averaging to advanced sensitivity-guided methods and gradient-based coefficient optimization, all grounded in the empirical finding that many LLM fine-tuning trajectories lie within a connected, low-loss basin of the parameter space (Song et al., 10 Mar 2026, Hitit et al., 26 Nov 2025).

1. Theoretical and Empirical Foundations

The foundational insight for model merging is the linear mode connectivity of neural network loss landscapes: fine-tuned LLMs initialized from the same checkpoint often admit low-loss interpolation paths in parameter space, enabling the arithmetic blending of weights to yield viable merged solutions (Song et al., 10 Mar 2026, Hitit et al., 26 Nov 2025). Formally, if $\theta_0$ is a pretrained base, and $\theta_1$ , $\theta_2$ are fine-tuned variants, task “delta” vectors are computed as

$\delta_i = \theta_i - \theta_0$

and merged weights by

$\theta_\text{merged} = \theta_0 + \sum_{i=1}^k \lambda_i \delta_i$

where $\lambda_i$ are mixing coefficients.

Mode connectivity manifests as empirically observed flat or nearly monotonic loss along these linear or geodesic paths in weight space (Song et al., 10 Mar 2026). This underpins the reliability of “Task Arithmetic” (uniformly summing “task deltas”) for LLMs—a result validated at both the small-model and multi-billion-parameter scale (Hitit et al., 26 Nov 2025).

Scaling laws further quantify merging returns: merging $k$ expert models into a base of capacity $C$ yields loss

$L(k, C) = L_\infty(C) + \frac{A(C)}{k + b}$

with $L_\infty(C)$ the irreducible floor that decreases as a power law in $\theta_1$ 0, and a $\theta_1$ 1 tail expressing diminishing returns from additional experts. These laws are robust across merging methods, model families, and task heterogeneity (Wang et al., 29 Sep 2025).

2. Merging Methodologies and Algorithms

LLM merging algorithms span a range of strategies, which can be organized as follows:

Task Arithmetic and Averaging: The simplest and most robust, task arithmetic (TA) applies equal weighting to each task’s parameter delta (Song et al., 10 Mar 2026, Hitit et al., 26 Nov 2025). Variants include Fisher-weighted averaging and trajectory (Stochastic Weight Averaging, SWA) approaches.
Sparsification and Interference-aware Schemes: To mitigate destructive interference among conflicting task deltas, methods like TIES-Merging (Trim-Elect-Sign) and DARE (Drop-And-REscale) sparsify or mask deltas by magnitude trimming, random drop, and sign consensus (Wang et al., 29 Sep 2025, Hitit et al., 26 Nov 2025). However, large-scale benchmarking finds these approaches often degrade performance compared to TA when many diverse, strongly fine-tuned checkpoints are merged (Hitit et al., 26 Nov 2025).
Subspace, Low-rank, and SVD-based Merging: Singular vector decomposition (TSV-Merge, Iso-C, SB) restricts updates to low-rank or orthogonalized subspaces, motivated by the hypothesis of disjoint task subspaces. These techniques have produced reliable gains in smaller models and classifiers but consistently underperform compared to TA in modern LLMs (Hitit et al., 26 Nov 2025).
Sensitivity-guided and Activation-informed Merging: Recent advancements use parameter gradient sensitivities (Liu et al., 18 Feb 2025) or activation-space statistics (Nobari et al., 4 Feb 2025) to assign layer/task-specific mixing coefficients, preserving critical weights and improving task retention, especially when combined with classical delta-based methods.
Gradient-based Merge Coefficient Optimization: SuperMerge introduces layer/task-specific, learnable merging weights, updating only $\theta_1$ 2 parameters using validation sets (Yang et al., 2024). This method achieves merged accuracy rivaling fully fine-tuned and multi-task models at minimal computational cost.
Distribution-based and Output-level Merging: The MoD (Mixture of Distributions) framework merges models at the output distribution level, forming a convex mixture of next-token probabilities and linearly combining the logits. This approach is architecture-agnostic and strong on specialist retention, especially in domains like mathematical reasoning (Dang et al., 2024).
Cross-Architecture Merging: When merging heterogeneous or multimodal models, mapping schemes (e.g., AdaMMS) or OT-based neuron alignment (Transport & Merge) infer cross-model correspondences and interpolate mapped weights (Du et al., 31 Mar 2025, Cui et al., 5 Feb 2026), enabling knowledge transfer without requiring shared layer structure.

The majority of real-world merging is performed at the level of adapter weights (e.g., LoRA), allowing modularity, memory efficiency, and direct deployment in compositional scenarios (Dmonte et al., 22 Jan 2026).

3. Practical Applications and Empirical Outcomes

LLM merging delivers:

Multi-task and Multi-domain Generalization: By judiciously merging checkpoints fine-tuned for distinct capabilities (instruction, code, math, multilingual, safety), LLMs can approach or surpass multi-task fine-tuned baselines without additional gradient steps (Fu et al., 14 Jun 2025, Hitit et al., 26 Nov 2025).
Alignment and Safety: Parameter-level merging of alignment-tuned models (helpfulness, honesty, harmlessness) achieves superior trade-offs compared to data-level mixture, particularly under conflicting objectives. The RESM algorithm further enhances balanced alignment via outlier-aware SVD weighting and sparsity-adaptive truncation (Yang et al., 8 Feb 2025).
Efficiency and Maintenance: Merging substantially reduces retraining and maintenance costs in multilingual and multi-task setups. For example, adapter-based language-merge pipelines cut initial fine-tuning time by up to 50% and maintenance time/cost by 60–70% compared to full retrain (Dmonte et al., 22 Jan 2026).
Pretraining and Model Recovery: Checkpoint merging at the pretraining stage (PMA) can emulate the effects of cosine learning-rate decay, stabilize optimization, and recover from loss spikes, resulting in higher final test accuracy and faster tuning (Li et al., 17 May 2025).
Personality and Attribute Modulation: Weight-space arithmetic over “personality vectors” extracted by trait-specific fine-tuning grants continuous, compositional, and cross-domain control over LLM personality traits (Sun et al., 24 Sep 2025).
Compression and On-the-fly Routing: 1bit-Merging fuses dynamic routing with bitwise-compressed task vectors, balancing performance and storage efficiency for memory-limited deployments (Liu et al., 15 Feb 2025).
Federated and Continual Learning: Model merging underpins branch-train-merge strategies, federated aggregation (FedAvg and beyond), and streaming incremental expert integration (Song et al., 10 Mar 2026).

4. Limitations, Scaling Laws, and Failure Modes

Extensive empirical studies reveal that, for modern LLMs:

Task Arithmetic is Uniquely Reliable: Only the oldest and simplest delta arithmetic reliably yields merged models that consistently outperform both the base and individual experts as the number of merged checkpoints grows (Hitit et al., 26 Nov 2025). More complex methods often damage performance, especially when merging highly divergent or strongly fine-tuned models.
Law of Diminishing Returns: Merging gains rapidly saturate—most of the benefit accrues within the first 5–6 experts, with marginal returns decaying as $\theta_1$ 3. Cross-domain and in-domain merges both obey this floor-plus-tail scaling law, which enables predictive planning, budget optimization, and principled stopping criteria (Wang et al., 29 Sep 2025).
Model Kinship and Merging Gains: The “kinship” (similarity) between task deltas predicts merge gains: low-kinship (more orthogonal) pairings yield larger gains and enable escape from local optima in iterative, top- $\theta_1$ 4 greedy merges. High-kinship merges quickly saturate and further merges add negligible value (Hu et al., 2024).
Interference and Over-pruning: Aggressive sparsification, subspace, or orthogonalization methods—premised on strong independence assumptions—fail when actual task deltas overlap in substantial parameter subspaces. Monitoring the $\theta_1$ 5 displacement of merged weights from the base is critical to ensure merges remain within the model’s low-loss region (Hitit et al., 26 Nov 2025).
Complex, Strongly Fine-tuned Deltas: When deltas have large, module-specific amplitudes (common with strong domain specialists), dynamic pruning and amplification (DPPA) can yield superior fusions at high sparsity (Zhu et al., 2024).

5. Benchmarks, Ecosystem, and Best Practices

A mature merging ecosystem includes:

Toolkits: MergeKit (Arcee), FusionBench, and similar libraries standardize delta arithmetic, sparsification, and evaluation pipelines, lowering barriers for practical merging (Song et al., 10 Mar 2026).
Benchmarks: Platforms such as the Open LLM Leaderboard and standardized suites (MMLU, GSM8K, HumanEval, TruthfulQA, FusionBench) allow rigorous assessment of merged model capability, retention rate, and interference matrices (Hitit et al., 26 Nov 2025).
Calibration Protocols: Even simple merging benefits from post-hoc validation, adaptive coefficient selection (Sens-Merging, activation-informed merging), and, in complex scenarios, supervised or unsupervised hyperparameter tuning (e.g., AdaMMS, SuperMerge) (Liu et al., 18 Feb 2025, Du et al., 31 Mar 2025, Yang et al., 2024).
Guidelines: Empirically, practitioners are advised to default to task arithmetic with modest delta norms, avoid merging highly similar or highly divergent models without calibration, and to exploit activation-space or sensitivity signals for challenging merges (Hitit et al., 26 Nov 2025, Nobari et al., 4 Feb 2025).

6. Open Challenges and Future Directions

While significant progress has been achieved, several challenges and research frontiers remain:

Theory: General theoretical explanations for the prevalence of linear mode connectivity and the generalization properties of merged billion-parameter LLMs are lacking (Song et al., 10 Mar 2026).
Scalability: As model size and the number of experts increase (>100B parameters, $\theta_1$ 6), both search and alignment costs rise super-linearly. Advanced approximation techniques, evolutionary search, and automated coefficient learning are required.
Heterogeneous and Modular Merging: Cross-architecture and multimodal merging—where models differ in structural details—are active areas, with OT-based correspondence and mapping techniques showing early promise (Cui et al., 5 Feb 2026, Du et al., 31 Mar 2025).
Safety and Security: Merging exposes new attack surfaces, such as “merge hijacking” via malicious backdoored models. Detection, certification, and adversarially robust merging objectives remain underexplored.
Standardization and Benchmarks: Protocols for interference/error reporting, emergent composite capability, and safety under merging are yet to be standardized.
Continual and Federated Merging: Automation of streaming expert integration and merging-aware fine-tuning regimes is an open engineering challenge for ongoing deployments (Song et al., 10 Mar 2026).

In sum, LLM merging recasts the development of advanced LLM systems from monolithic retraining to modular compositional assembly. Its continued evolution will depend on both methodological innovation and deeper understanding of neural loss geometry, interference dynamics, and task transfer in high-dimensional spaces.