Cross-Modal Model Merging

Updated 13 August 2025

Cross-Modal Model Merging is the principled integration of modality-specific models into a unified system to jointly process heterogeneous data.
Architectural strategies like multiplicative fusion, contrastive alignment, and neuron-level fusion address challenges such as noise, imbalance, and catastrophic forgetting.
Practical applications span vision-language models, autonomous driving, and biomedical analysis, demonstrating performance gains over traditional fusion techniques.

Cross-modal model merging refers to the principled integration of models, representations, or parameters associated with different data modalities—such as vision, language, audio, or other sensor data—into a unified system or architecture that can jointly process, align, or leverage information across these domains. The term encompasses a range of strategies from architectural fusion and latent alignment to parameter-level combination of modality-specific expert models. The central challenge is to maximize synergy and task performance by harnessing the complementary strengths of each modality while mitigating issues such as noise, modality imbalance, or catastrophic forgetting.

1. Foundations and Motivations

Cross-modal model merging is driven by the need to fully exploit multimodal data in tasks where information is distributed across heterogeneous sources. Classical approaches to multimodal learning often rely on simple feature concatenation or additive fusion, which risk overfitting or fail to handle the variable reliability and informativeness of individual modalities (Liu et al., 2018). More recent advances emphasize:

Selective attention or weighting per modality or modality-mixture
Architectural unification accommodating heterogeneous encoders
Contrastive or information-theoretic objectives to align diverse representation spaces
Parameter-level or neuron-level merging for specialist models without retraining

Successful cross-modal merging supports a broad class of applications, from document understanding, biomedical analysis, and autonomous driving to foundation models in vision-language and multimodal LLMs (MLLMs).

2. Fusion and Alignment Architectures

Architectural strategies for cross-modal merging vary in terms of integration granularity and flexibility. Representative classes include:

Multiplicative Fusion: Instead of averaging predictions, each modality produces an independent output which is merged multiplicatively, with selective down-weighting for less reliable streams. The key scaling factor per modality $q_i = \left[ \prod_{j \neq i} (1-p_j) \right]^{\beta/(M-1)}$ ensures that strong modalities dominate model updates on a per-sample basis (Liu et al., 2018).
Contrastive Alignment: Unified architectures, such as UNIMO, project inputs from each modality (text, image) into a shared semantic space via a multimodal Transformer. Cross-modal contrastive learning (CMCL) is employed to align paired visual and textual features by pulling positives together and pushing negatives apart in the embedding space, using objectives such as:

$\mathcal{L}_{CMCL} = \mathbb{E}_{(V,W)} \left[ -\log\left(\frac{\text{pos}}{\text{pos}+\text{neg}}\right) \right]$

where pos/neg denote similarity-weighted scores over augmented positive and negative pairs (Li et al., 2020).

Cross-Attention and Mixture-of-Experts: Modules such as inter-modality cross-attention (e.g., in VLCDoC) or mixture-of-experts adapters (CMoE-Adapter) build rich associations between visual and textual streams by dynamically fusing and routing information based on semantic context (Bakkali et al., 2022, Xia et al., 1 Apr 2025).
State Space Modeling and Lightweight Fusion: For efficiency, methods like CM-SSM employ cross-modal state space models that establish dependencies between RGB and thermal sequences, avoiding quadratic complexity of Transformer-based attention by operating using selective 2D scans and residual convolutions (Guo et al., 22 Jun 2025).

3. Parameter-level and Model-level Merging Techniques

Parameter-level merging addresses scenarios where multiple task- or modality-specific models are to be combined into a single multi-task or multimodal predictor, without access to primary data:

Seed Pretraining and Interpolation: Methods such as those in (Sung et al., 2023) require that modality-specific models be initialized from a shared seed checkpoint (to ensure location in a common loss basin). Merging is performed via element-wise interpolation, modality task arithmetic, or closed-form regularized means (RegMean). The outcome depends strongly on initialization quality and proper tuning of mixing ratios.
Elect, Mask, and Rescale (EMR-Merging): EMR-Merging introduces an “election” of a common task-vector via maximal sign alignment, followed by extremely lightweight task-specific modulators—bit masks and rescalers—that preserve direction and magnitude for each task. The merged weight for task i is computed as $\hat{\tau}_i = \lambda_i (M_i \odot \tau_{uni})$ , where $\lambda_i$ matches the average magnitude, and the mask $M_i$ selects aligned directions (Huang et al., 23 May 2024).
Parameter Competition Balancing (PCB-Merging): PCB-Merging addresses parameter conflicts by explicitly estimating intra-task and inter-task parameter importances using softmax-normalized squared task-vector magnitudes and their cross-task similitude (via element-wise products). Binary masks drop low-importance parameters, and remaining components are rescaled and averaged (Du et al., 3 Oct 2024).
Neuron-Level Fusion: Locate-then-Merge and Neuron-Fusion frameworks identify neurons with the largest parameter shifts (likely encoding new capabilities), suppressing small (possibly interfering) changes that harm language skills in MLLMs. Selective restoration or rescaling at the neuron level enables balance between new modality adaptation and retention of prior knowledge (Yu et al., 22 May 2025).
Cross-modal Cohort Mutual Learning: Meta Fusion constructs a student cohort, each using different latent representations and modality pairings. Models share outputs in a soft, adaptive mutual learning step, where only high-performing students teach weaker ones, and aggregation via ensemble selection yields robust cross-modal merging (Liang et al., 27 Jul 2025).

4. Objectives and Theoretical Guarantees

The unification of cross-modal information benefits from objectives designed to promote robust semantic correspondence, task alignment, and reduced interference:

Contrastive and Predictive Losses: CMCL, InfoNCE, and margin-based boosting (in multiplicative fusion methods) are key to aligning representations and jointly reducing modality conflict while focusing on hardest examples (Liu et al., 2018, Li et al., 2020, Parker et al., 2023).
Generalization Error Analyses: Adaptive mutual learning in Meta Fusion is supported by theoretical results indicating that soft information sharing strictly reduces aleatoric variance in prediction error for small coupling hyperparameters, without increasing bias. Theoretical bounds on post-merge task performance are derived as $\|\mathcal{L}_i(\theta_0 + \tau_m) - \mathcal{L}_i(\theta_0 + \tau_i)\| \leq O(\eta T)$ , i.e., bounded by total parameter change during fine-tuning (Wei et al., 26 May 2025, Liang et al., 27 Jul 2025).
Catastrophic Forgetting Mitigation: Neuron-level and parameter importance-based approaches (like PCB-Merging and Locate-then-Merge) quantitatively reduce loss of previously learned skills, as evidenced by improved language and visual ability scores, as well as reduced hallucination in generative tasks (Yu et al., 22 May 2025, Du et al., 3 Oct 2024).

5. Empirical Results and Practical Applications

Across a spectrum of benchmarks and modalities, cross-modal model merging achieves improvements over additive or concatenative baselines:

Vision-Language and Multimodal LLMs: Unified models obtained via advanced merging algorithms recover or even outperform expert models specializing in individual tasks, with documented average gains (e.g., +3% VQA, +7% COCO retrieval) and substantial improvements in complex reasoning/grounding tasks (Sung et al., 2023, Wei et al., 26 May 2025).
Document AI and Scientific Domains: Architectures like VLCDoC with explicit intra- and inter-modality contrasting demonstrate robust generalization, maintaining high accuracy even with reduced pretraining data (Bakkali et al., 2022). Domain-specific cross-modal foundation models (AstroCLIP) transfer retrieval and property estimation capabilities from images to spectra and vice versa (Parker et al., 2023).
Efficient Real-Time Systems: State space modeling and selective scan fusion in RGB-thermal wild scene segmentation yields high segmentation accuracy (mIoU 74.6% on CART) with computation efficiency (114 FPS at 12.59M parameters) (Guo et al., 22 Jun 2025).
Continual and Federated Cross-Modal Generalization: Continual merging of modalities via adapter-based dynamic codebooks, guided by mediator modalities and EWC-style regularizations (CMoE-Adapter and PMR), supports strong zero-shot generalization and expansion to new unseen modality pairs (Xia et al., 1 Apr 2025).
Privacy-Preserving and Distributed Scenarios: Training-free merging of models and normalization statistics, justified by mode connectivity and Gaussian prior assumptions, enables robust multi-domain model assembly without access to source data (Li et al., 18 Jul 2024).

6. Limitations, Open Problems, and Future Outlook

Despite empirical and practical advances, several open issues persist:

Dependency on Common Initialization: Most parameter merging techniques require models to be initialized from identical pretrained checkpoints to ensure their weights are mode-connected and merging is effective (Sung et al., 2023, Li et al., 18 Jul 2024).
Modality Interaction Granularity: Fine-tuned merging of parameter subsets, layers, or neurons requires reliable task vector decomposition and robust importance estimation.
Capacity Expansion and Continual Growth: Dynamic codebook expansion (to accommodate new modalities) and protection against catastrophic forgetting demand scalable replay or regularization strategies (Xia et al., 1 Apr 2025).
Noise and Redundancy: Efficient noise removal—e.g., via low-rank approximations of the task vector SVD spectrum—remains an active research area (Wei et al., 26 May 2025).
Empirical Evaluation and Benchmarks: Benchmarks like those for MLLM merging (tasks: VQA, OCR, Grounding) are emerging, but broader, standardized evaluations across data regimes and application domains are required for robust comparison (Wei et al., 26 May 2025).

A likely direction is the development of unified, scalable architectures and algorithms that merge arbitrary numbers of modality-specific and task-specific models, with theoretical guarantees on performance and stability, and deployment protocols for real-world, federated, and privacy-sensitive settings.

7. Summary Table of Core Methodologies

Merging Approach	Core Mechanism	Key Reference
Multiplicative Fusion	Per-sample confident modality selection	(Liu et al., 2018)
Transformer-based Alignment	Shared semantic space, CMCL loss	(Li et al., 2020)
Parameter Interpolation	Averaging or task arithmetic	(Sung et al., 2023)
EMR/PCB Merging	Election, masks, rebalancing	(Huang et al., 23 May 2024, Du et al., 3 Oct 2024)
Neuron-Fusion	Selective neuron preservation	(Yu et al., 22 May 2025)
Ensemble/Mutual Learning	Latent pairing, adaptive soft sharing	(Liang et al., 27 Jul 2025)
SSM-based Real-Time Fusion	Cross-modal state space with linear cost	(Guo et al., 22 Jun 2025)

This synthesis demonstrates that cross-modal model merging is a core problem at the intersection of deep learning, multimodal fusion, and model compression, with a rapidly expanding methodology suite designed to robustly exploit multimodal signals, transfer specialist expertise, and underpin next-generation unified AI systems.