Merging Knowledge into Models

Updated 18 September 2025

Merging knowledge into models is the process of integrating, transferring, and aggregating distributed neural information into a unified parameterized system.
It employs techniques such as continuous knowledge bases, parameter-space fusion, and task vector methods to mitigate interference and enhance performance.
This approach enables robust multitask learning and continual adaptation by addressing challenges like feature drift and heterogeneous model integration.

Merging knowledge into the model is the process of integrating, transferring, or aggregating knowledge that has been encoded in multiple neural networks or submodels into a unified, parameterized repository or model. This theme encompasses a spectrum of methodologies—from constructing universal continuous repositories that interface with neural networks, to parameter-space fusion of specialized fine-tuned models, to algorithmic procedures that minimize functional, representational, or feature-level drift as knowledge is aggregated. The central objective is to encapsulate the implicit, often continuous, knowledge embedded in several distributed AI systems into a format amenable to reuse, transfer, or unified inference.

1. Continuous Knowledge Bases and the Function Simulation Paradigm

The notion of a Continuous Knowledge Base (CKB) provides a foundation for merging knowledge across multiple neural networks by treating their parameterized function spaces as the carriers of knowledge rather than manipulating explicit symbolic entities. In CKBs, knowledge is represented as a set of learnable real-valued parameters—structured in a memory hierarchy comprising one high-level matrix ( $M^h$ ) and several low-level matrices ( $M^{l_1}\ldots M^{l_K}$ ):

$M = \{ M^h, M^{l_1}, \ldots, M^{l_K} \}$

For every source neural network $\text{FNN}_\theta(\cdot)$ , an interface $\text{Interface}^{(\text{FNN})}_\phi(\cdot, M)$ is constructed, parameterized to mimic the source model's functional behavior on its input-output mapping. Merging knowledge thus becomes a function simulation problem: the CKB coupled with the interface minimizes the discrepancy between its output and that of the original network across the input domain, using an import loss:

$L_{\text{import}}(D, M, \phi) = \sum_{n=1}^N \Delta(\text{Interface}^{(\text{FNN})}_\phi(x^{(n)}, M), \text{FNN}_\theta(x^{(n)}))$

Here, $\Delta$ represents a metric such as cosine similarity. Once the knowledge is stored, it can be exported to new networks by minimizing a matching export loss with respect to the target model's parameters.

Merging from multiple neural networks is naturally expressed via parallel import losses for each network-interface pair and a joint optimization over $M$ and all $\phi_i$ , capturing and fusing knowledge from diverse tasks and modalities. Empirical results show that the fused knowledge can be exported back to a single model and sometimes exceed the accuracy of the individual source models, demonstrating effective distillation and knowledge synergy (Chen et al., 2020).

2. Parameter-Space Model Fusion

Many approaches to knowledge merging exploit the structure of parameter space, seeking to interpolate, combine, or aggregate models trained with different data or for different tasks. The regression mean (RegMean) method directly seeks the weight vector $W_M$ that minimizes the sum of squared prediction errors across all source model datasets $X_i$ :

$W_M = \left(\sum_{i\in \mathcal{K}} X_i^TX_i\right)^{-1} \left(\sum_{i\in \mathcal{K}} X_i^TX_i W_i\right)$

This result extends layer-wise to deep architectures (e.g. Transformers) and, when applied to the main weight matrices, allows dataless knowledge merging—fusing capabilities without access to original training data. The entrywise averaging is replaced by an input-matrix-weighted fusion that accounts for the data statistics underlying each source model. In practice, this strategy yields merged models that match or surpass the individual source models in domain generalization, outperform Fisher-weighted or naive averaging baselines, and is computationally efficient (Jin et al., 2022).

Extending to continual learning, merged parameter updates can be successively accumulated using the same principle: at each task $t$ , the accumulated inner product matrix $P_t$ and the current task's $C_t$ produce a merged parameter

$\overline{W}_t = (P_t + C_t)^{-1}(P_t \overline{W}_{t-1} + C_t W_t)$

and the accumulation $P_{t+1} = P_t + C_t$ ensures that the merge order is commutative, enabling efficient multi-domain integration in scenarios like vision segmentation (Shu et al., 16 Jul 2025).

3. Task Vector and Trust Region Approaches

An influential line of research represents task-specific knowledge as parameter-space offsets or "task vectors." For a pre-trained model $\theta_{\text{pre}}$ and a task-specific model $\theta_{task_k}$ , the difference vector $T_k = \theta_{task_k} - \theta_{\text{pre}}$ encodes the knowledge required for the new task. Model merging is then the process:

$\theta_{MTL} = \theta_{\text{pre}} + \lambda \sum_k T_k$

A critical challenge is knowledge conflict: task vectors trained on different data may interfere, leading to destructive interference and degraded multi-task performance. Task Arithmetic in Trust Region (TATR) introduces geometric constraints, merging only the components of task vectors that are orthogonal (i.e., non-conflicting) in the parameter space. Removal bases or binary masks are constructed to restrict updates to directions that do not cause large loss increases for any task. This careful restriction to the trust region in parameter space demonstrably reduces negative transfer and enhances joint-task performance (Sun et al., 25 Jan 2025).

The Twin-Merging framework explicitly decomposes model weights into shared and exclusive components using task arithmetic, improving further by compressing the exclusive parts with SVD and dynamically routing the merging based on the input context to minimize interference during inference (Lu et al., 17 Jun 2024).

4. Joint Representation Alignment: SVD, Decomposition, and Feature Drift

Direct entrywise merging of model parameters assumes functional alignment among corresponding neurons, an assumption violated by neural feature polysemy post fine-tuning. Decom-Renorm-Merge (DRM) and related methods address this by decomposing the stacked weight delta matrices from each task layer via Singular Value Decomposition (SVD), thereby discovering a shared joint space for merging:

$\Delta W_l^{\text{stack}} = [\Delta W_l^{(1)}, \ldots, \Delta W_l^{(N)}] = U \Sigma V^T$

After splitting $V^T$ into task-specific blocks and renormalizing (critical for ensuring algorithmic stability), merged weights are reconstructed as

$\Delta W_l^{\text{M}} = U \bigl((\Sigma \tilde{V}^T)^{\text{M}}\bigr)$

compensating for misaligned feature bases and enabling robust multitask model construction (Chaichana et al., 29 May 2025). LOT Merging further refines this approach by minimizing "feature drift"—the difference in feature representations across task-specific models—on a per-layer basis. Optimization is formulated as an explicit closed-form minimization for each linear, scaling, or bias layer:

$T_l^* = \left(\sum_k X_k^T X_k \right)^\dagger \left( \sum_k X_k^T X_k T_k \right)$

which guarantees minimal mean-squared error in the merged feature representations, efficient computation, and higher empirical accuracy relative to parameter-level or task-loss approaches (Sun et al., 29 May 2025).

5. Heterogeneity and Modular Interfaces

Integrating knowledge across heterogeneous models, tasks, or modalities necessitates bridging nontrivial structural or domain gaps. MergeNet introduces a parameter adapter $A(\cdot)$ that queries low-rank representations from a source model and projects them into the target space, facilitated by attention-based aggregation modules. Merging is performed during training, adapting over cycles denoted $T_{\text{cycle}}$ , and the merged model at inference uses no additional computation:

$\theta_{\text{merge}} = \theta_{\text{target}} + A(\theta_{\text{source}}^{\text{lowrank}})$

This design enables robust knowledge transfer even across deeply divergent networks, providing a blueprint for future heterogeneous AI system integration (Li et al., 20 Apr 2024).

In the context of LLM ensembles, knowledge fusion explicitly externalizes the output distributions of a diverse set of LLMs, aligning tokens via dynamic programming and fusing output distributions using functions like MinCE or weighted averages. This "output-level" merging sidesteps architectural constraints and is particularly scalable where model-to-model weight matching is infeasible (Wan et al., 19 Jan 2024, Kong et al., 28 May 2025).

6. Knowledge Editing, Continual Learning, and the Limits of Merging

Efficient model merging is pivotal for online knowledge editing and continual learning. Robust supervised fine-tuning (R-SFT), followed by model merging via weighted parameter deltas and pruning, enables LLMs to absorb new facts while preserving general-purpose abilities and supporting sequential edits with minimal interference (Fu et al., 14 Jun 2025). In continual learning scenarios, the choice between merging models incrementally along a training trajectory and merging parallel-trained models has a marked impact: incremental merging better consolidates shared general knowledge but both approaches struggle with unshared, exclusively task-specific knowledge, which is rapidly forgotten under naive linear averaging (Hess et al., 31 Jul 2025).

Recent work demonstrates that model selection and dynamic fusion can be enhanced by learning adaptive selectors and weighted fusion networks, thereby further reducing knowledge interference as the number of source models scales (Kong et al., 28 May 2025).

7. Implications, Challenges, and Future Directions

The surveyed methodologies show that merging knowledge into models is an evolving field that balances effective knowledge transfer, minimization of interference, computational efficiency, and practical constraints (such as datalessness or scalability). Robustification of the merging process (e.g., through SVD-based decomposition, representation alignment, or adaptive weighting) is essential for high-fidelity multitask deployment and continual adaptation.

Challenges remain in ensuring robust integration of technical vocabulary or specialist domain knowledge, particularly in cross-lingual or cross-modal scenarios where simple parameter averaging is insufficient (Rousset et al., 17 Feb 2025). Future work is suggested in the direction of explicit vocabulary alignment, advanced adapter designs, closed-form interventions for representation bias correction, and hierarchical knowledge repositories capable of continuous, interpretable aggregation.

The paradigms documented above underpin a growing toolkit for scalable, modular AI system construction and maintenance as model heterogeneity and the dynamism of world knowledge increase.