Cross-Domain Model Merging
- Cross-domain model merging is a set of techniques that integrate neural networks trained on distinct domains, preserving each model's strengths while enabling multi-task deployment.
- Key methods include linear interpolation, optimization-based fusion, and sparsification, which address challenges like parameter conflict and catastrophic forgetting.
- This approach enhances performance and efficiency in settings like continual, federated, and privacy-sensitive learning environments, offering practical deployment benefits.
Cross-domain model merging refers to the set of methodologies for integrating neural networks that have been individually trained or fine-tuned on datasets from distinct domains, tasks, or environments into a single unified model. The objective is to produce a model that retains, synthesizes, or generalizes the capabilities of all constituent sources, without catastrophic interference or significant performance loss. Approaches range from explicit parameter manipulation (e.g., weight interpolation, pruning, optimization-based fusion) to the use of auxiliary modules or training-free statistical procedures; these methods target diverse contexts such as vision, language, multimodal, and continual learning systems. Cross-domain merging is crucial for multi-task deployment, continual adaptation, federated learning, and settings where direct access to source data is infeasible due to privacy, bandwidth, or annotation constraints.
1. Foundational Principles and Motivations
Cross-domain model merging is rooted in the observation that large foundation models and their derivatives (specialized or fine-tuned for different domains) often reside within a mode-connected parameter space, provided they share a common initialization (Li et al., 18 Jul 2024, Ye et al., 2023). This property enables parameter-space operations—such as linear interpolation, arithmetic, or convex combination—to produce merged models that perform adequately on tasks across constituent domains. The motivations for model merging encompass:
- Decoupling learning and integration: Enabling independent domain adaptation or specialization followed by efficient post hoc merging without joint retraining (Ruan et al., 12 Mar 2025).
- Parameter efficiency and deployment: Substantially reducing storage and inference resource requirements compared to ensembling (Ye et al., 2023, Lin et al., 1 Jul 2024).
- Data privacy and bandwidth: Avoiding shared data pools by operating solely on model weights/statistics (Li et al., 18 Jul 2024, Shin et al., 29 May 2025).
- Continual and federated learning: Aggregating domain-specific expertise in dynamic scenarios where data arrives sequentially or is distributed across sources (Shu et al., 16 Jul 2025).
The foundational challenge is to reconcile conflicting or domain-specific parameter updates, ensuring the merged model generalizes or, at minimum, preserves the source models’ strong capabilities.
2. Parameter-Space Merging Techniques
A core axis of cross-domain merging involves parameter-space manipulations, with several classes of methods:
2.1 Direct Interpolation and Arithmetic
- Linear/interpolation averaging: for weights from domains A and B, extended to multiple models via convex combinations (Li et al., 18 Jul 2024, Ruan et al., 12 Mar 2025, Ye et al., 2023).
- Task arithmetic: Uses "task vectors" , merging via (Ruan et al., 12 Mar 2025, Thakkar et al., 11 Nov 2024). This framework is extended to "domain vectors" and "alignment vectors" to balance domain knowledge and safety (Thakkar et al., 11 Nov 2024).
- Mode connectivity requirement: Optimal results are obtained when source models are initialized from the same checkpoint and fine-tuned with small learning rates to retain linear connectivity in parameter space (Li et al., 18 Jul 2024, Ye et al., 2023).
2.2 Optimization-Based Merging
- Closed-form regression-based fusion: For linear layers, weights are merged to minimize output discrepancy: , where are input covariance matrices (Shu et al., 16 Jul 2025).
- Orthogonalization: For parameter-efficient fine-tuning (e.g., LoRA), direction and magnitude of weight updates are decoupled and orthogonally aligned before merging, reducing destructive interference (Zheng et al., 21 May 2025).
- Evolutionary/gradient-based recipe search: Evolutionary algorithms optimize the merging recipe in both parameter and data-flow spaces, allowing for cross-domain transfer and custom blending (Akiba et al., 19 Mar 2024).
2.3 Pruning and Sparsification
- Magnitude-based pruning: Eliminates less significant delta parameters before merging to reduce parameter conflict (Zhu et al., 5 Mar 2024). Methods such as DPPA (Dynamic Pruning Partition Amplification) partition and dynamically amplify significant parameter subsets, preserving just 20% of domain-specific parameters while matching the performance of much denser models (Zhu et al., 5 Mar 2024).
- Redundancy-aware merging: Trims redundant or low-magnitude parameter changes either within the training trajectory (historical averaging) or across domains for efficient fusion (Ding et al., 11 Jun 2025).
3. Conflict Mitigation and Information Integration
3.1 Parameter Competition and Importance Balancing
- PCB-Merging (Parameter Competition Balancing): Assesses intra-task significance and inter-task similarity for each parameter, applying dropout and adaptive rescaling to prevent low-importance or conflicting parameters from degrading merged performance (Du et al., 3 Oct 2024).
- Drop-and-rescale strategies: Techniques such as DARE, TIES-Merging, and DPA discard or dampen non-significant deltas and then rescale remaining components (Zhu et al., 5 Mar 2024, Ruan et al., 12 Mar 2025).
- Activation/statistics-based fusion: For models with batch normalization, merge running means and variances using weighted statistical formulas to harmonize distributional assumptions across domains (Li et al., 18 Jul 2024).
3.2 Frequency Domain Filtering
- FREE-Merging: Recognizes that detrimental task interference stems from conflicting low-frequency components in fine-tuned parameters. It applies Fourier transform filtering to remove low frequencies from task vectors before merging, and compensates for performance loss using lightweight expert subnetworks (Zheng et al., 25 Nov 2024).
3.3 Gating and Mixing Control
- Gating networks: An auxiliary classifier network predicts domain association for incoming inputs and weights the contribution of each source model’s parameters per layer, with the possibility to select the most appropriate classifier head (Ye et al., 2023).
- Layer-level control: Merge layers using either the gating approach or standard averaging according to a similarity metric—typically cosine similarity of parameter vectors. Layers with greater divergence are handled adaptively (Ye et al., 2023).
4. Specialized Contexts and Methodological Variants
4.1 Multimodal and Multilingual Fusion
- Vision-language and cross-modal merging: Interpolation and task arithmetic are extended to architectures with separate vision, language, and multimodal branches, combining weights per modality and cross-modal block (Sung et al., 2023, Wei et al., 26 May 2025).
- Latent factor sharing for recommender systems: Enforces equality constraints on shared user factors across implicit matrix factorization models for different domains using ADMM (Alternating Direction Method of Multipliers) consensus updates (Samra et al., 23 Sep 2024).
- Multilingual knowledge transfer: Weighted or slerp-style merging of language-specific and domain-specific models in cross-lingual settings achieves limited success for technical vocabulary acquisition, highlighting persistent cross-lingual alignment bottlenecks (Rousset et al., 17 Feb 2025).
4.2 Quantization and Edge Deployment
- Quantization-aware merging: HDRQ (Hessian and Distance Regularizing Quantization) ensures post-training quantization supports mergeability by regularizing the Hessian (to flatten the loss surface) and minimizing distance from the source, concurrently addressing smoothing and alignment for quantized weights (Shin et al., 29 May 2025).
4.3 Continual and Federated Learning
- Continual adapter merging: RegCL enables non-replay continual learning by sequentially merging LoRA modules for new domains, using closed-form and averaging-based update rules parameterized by domain inner-product statistics, maintaining a constant model size and minimal catastrophic forgetting (Shu et al., 16 Jul 2025).
- Federated/decentralized scenarios: Merge parameters or statistical summaries from local models without exchanging original training data, crucial for privacy and regulatory compliance (Li et al., 18 Jul 2024, Shin et al., 29 May 2025).
5. Performance, Empirical Results, and Limitations
Empirical studies across computer vision, NLP, and multimodal tasks consistently demonstrate that carefully designed merging schemes can:
- Match or marginally outperform task-specific models in their respective domains, while enabling cross-domain generalization (Ye et al., 2023, Wei et al., 26 May 2025, Zhu et al., 5 Mar 2024).
- Scale to high numbers of constituent models (up to 12 in ViT merging), with negligible performance loss when using gating or adaptive mixing (Ye et al., 2023).
- Achieve state-of-the-art results in benchmarks for reward modeling, code-switching speech recognition, domain generalization, and technical vocabulary transfer under controlled settings (Lin et al., 1 Jul 2024, Peng et al., 2022, Samra et al., 23 Sep 2024, Rousset et al., 17 Feb 2025).
Limitations persist in handling highly divergent architectures, models initialized from non-shared checkpoints (loss of linear mode connectivity), persistent domain interference, and cross-lingual transfer of technical or semantic content (Rousset et al., 17 Feb 2025). Quantization introduces discretization barriers, counteracted only with careful regularization (Shin et al., 29 May 2025).
6. Diagnostic Metrics, Taxonomy, and Theoretical Insights
6.1 Similarity and Predictive Metrics
- Soft Sign Dissimilarity (SSD) and TSSD: Quantify predictive success of merging two weights/modules via sign agreement and truncated magnitude analysis, with high correlation to empirical performance drops post-merge (Sung et al., 2023).
- Error barriers along interpolation paths: Used to assess the flatness of the loss surface between models; higher barriers predict poor merging performance unless regularized (Shin et al., 29 May 2025).
- L2 distances and parameter statistics: Employed to gauge equidistance between merged and source models, particularly in alignment trade-off studies (Thakkar et al., 11 Nov 2024).
6.2 Taxonomic Organization
A unified taxonomy for merging methods encompasses:
Category | Main Principle | Key Methods/Examples |
---|---|---|
Permutation/Alignment | Weight/activation matching | Git Re-Basin, MuDSC |
Direct Merging | Linear/Arithmetic in params | Model Soup, Task Arithmetic |
Magnitude Pruning | Pre-merge sparsification | DARE, TIES, DPPA |
Activation Pruning | Neuron/activity-based | ZipIt!, SurgeryV2, MACL |
Optimization-based | Sensitivity/Regression | Fisher Avg, RegMean, WUDI |
LoRA/MoE Merging | Modular or PEFT-based | LoraHub, Twin Merging |
7. Applications and Future Directions
Cross-domain model merging enables:
- Robust multi-domain deployment of visual, language, and multimodal systems under bandwidth and privacy constraints (Ding et al., 11 Jun 2025).
- Rapid adaptation and aggregation in federated, continual, and decentralized learning contexts (Shu et al., 16 Jul 2025, Li et al., 18 Jul 2024).
- Enhanced resource efficiency, especially when leveraging quantization and parameter-efficient fine-tuning (Shin et al., 29 May 2025, Zheng et al., 21 May 2025).
- Improved technical alignment of models for specialized domains (reward modeling, domain-aware alignment, safety) without collecting or labeling new data (Lin et al., 1 Jul 2024, Thakkar et al., 11 Nov 2024).
Ongoing research focuses on integrating model compression with merging, structure- and task-aware merging schemes, stronger theoretical guarantees on mergeability, cross-lingual and cross-modal extension, and user-friendly, automated merging pipelines (Ruan et al., 12 Mar 2025, Zheng et al., 21 May 2025).
Cross-domain model merging is advancing rapidly with new theoretical, algorithmic, and practical innovations. The field’s development is shaped by increasingly stringent deployment, privacy, and efficiency requirements, and by the realization that cross-domain synthesis of capabilities is essential to modern AI’s continued progress.