Multimodal Model Merging
- Multimodal Model Merging is the algorithmic fusion of deep neural models from diverse modalities to create a unified parameter set that preserves individual expertise.
- It integrates methods like weight-space interpolation, sparsification, and attention alignment to balance modality-specific capabilities and minimize performance interference.
- Emerging strategies address challenges such as catastrophic forgetting and heterogeneous architecture merging, paving the way for scalable, lifelong multimodal systems.
Multimodal Model Merging is the process of algorithmically fusing multiple deep neural models—each trained on different modalities or multimodal tasks—into a single parameter set that expresses the complementarity of the original networks. This approach enables the construction of unified, zero-shot capable large models that leverage the diversity of independently optimized experts for language, vision, audio, and other modalities, without resource-intensive joint fine-tuning or access to the original pretraining data. The field integrates advances in weight-space interpolation, sparsification, attention alignment, and importance-driven fusion, with recent work extending to parameter-efficient, heterogeneous-architecture, and lifelong (temporal) multimodal settings. Key challenges include mitigating catastrophic forgetting, balancing modality-specific and cross-modal capabilities, and reducing performance interference among constituent experts.
1. Foundations and Motivations
Model merging emerged to address practical limitations in large-scale multimodal systems: the cost and inflexibility of joint retraining, the need for rapid integration of modality/domain specialists, and the desire to preserve or superimpose fine-grained expert capabilities within a single deployable backbone. In the multimodal setting, this means combining models such as vision–language, audio–language, or point cloud–language LLMs, or deeper fusions of visual, audio, and coding LLMs into a single MLLM.
The foundational principle is the weight-space connectivity of deep networks: independently fine-tuned models, originating from a shared initialization, typically reside in a connected (flat) loss basin, permitting direct parameter-space interpolation or arithmetic to fuse expertise (Yang et al., 2024). Empirical studies confirm that simple forms of layerwise or vectorial merging suffice to approximately preserve performance on both constituent modalities and joint tasks, even in the absence of further data (Sung et al., 2023, Takmaz et al., 2 Oct 2025, Wang et al., 12 Jan 2026).
2. Core Methodologies
Multimodal model merging now encompasses a rich taxonomy of parameter fusion strategies, most of which can be organized into the following categories:
A. Linear and Arithmetic Merging
The simplest approach linearly interpolates two (or more) parameter sets sharing the same architecture and initialization: or, more generally, for N experts: Task arithmetic operates in task-vector space relative to a shared base: This family underlies classic model soups, task vector addition, and layerwise averaging (Yang et al., 2024, Takmaz et al., 2 Oct 2025, Wei et al., 26 May 2025).
B. Parameter-Efficient and PEFT Merging
When specialist models are trained via parameter-efficient methods (e.g. LoRA, adapters), merging is confined to low-rank update matrices: CoPA-Merging introduces structured pruning of singular vectors, scaling, and cross-task normalization to maintain directionality and balance contributions of each expert under LoRA adaptation (Zeng et al., 24 Feb 2025).
C. Alignment and Modular Decoupling
Methods such as DAMC (Chen et al., 2024) and MMER (Li et al., 21 May 2025) decouple subnetworks or generate masks to isolate modality-specific parameters. MMER’s process:
- Merge fine-tuned models in the task-vector domain using sign-consistent summation (e.g., TIES merge).
- Construct binary masks for each modality to isolate only coordinates whose directional contribution matches the merged vector and exceeds a significance threshold.
- At inference, input for modality is routed via the corresponding mask to avoid destructive interference.
D. Neuron and Layerwise Selective Fusion
Locate-then-Merge and Neuron-Fusion (Yu et al., 22 May 2025) apply neuron-level selection, preserving large-shift (“vision-specialist”) neurons and suppressing diffused shifts that induce forgetting. Layer-level or blockwise merging—such as plateau-guided merging in PlaM (Wang et al., 12 Jan 2026)—restores base LLM weights in late stages to recover language reasoning while preserving cross-modal alignment earlier.
E. Heterogeneous and Architecture-Aware Merging
AdaMMS merges MLLMs with different architectures (e.g., layer duplication, additional QKV heads) by mapping, linearly interpolating corresponding tensors, and searching for optimal interpolation coefficients via unsupervised consistency on unlabeled data (Du et al., 31 Mar 2025).
F. Optimization-Based and Chunked Merging
Expert Merging (Zhang et al., 30 Sep 2025) and its extension, Expert Merging++, learn a small set of per-layer (or chunk) coefficients via loss alignment of hidden states/logits with the experts, regularized for stability, often using only a modest calibration set.
| Method Name | Key Idea/Formula | Notable Applications |
|---|---|---|
| Linear Interp./TA | Language/vision fusion (Takmaz et al., 2 Oct 2025, Sung et al., 2023) | |
| MMER | Masked, sign-consistent task vector merging & decoupling | Multimodal expansion/retention, catastrophic forgetting mitigation (Li et al., 21 May 2025) |
| Neuron-Fusion | Select large-shift neurons; suppress/restore | Mitigate language drop in MLLMs (Yu et al., 22 May 2025) |
| AdaMMS | Mapping-based blockwise merge + unsupervised coeff. search | Heterogeneous backbone merges (Du et al., 31 Mar 2025) |
| Plateau-guided | Restore LLM weights in plateau (late) layers | Language retention, visual grounding (Wang et al., 12 Jan 2026) |
| Expert Merging(++) | Layerwise/chunkwise coefficients via alignment | Multi-expert LLM/MLLM (Zhang et al., 30 Sep 2025) |
3. Advanced Strategies: Interference, Sparsification, and Regularization
Central challenges in multimodal merging are performance interference, gradient conflicts, and catastrophic forgetting—especially pronounced when fusing highly heterogenous tasks or modalities. Several classes of solutions have been proposed:
- Sparsity and Masking: EMR-Merging (Huang et al., 2024) elects a unified task vector using majority sign, per-task masks incompatible directions, and rescales magnitude such that each expert’s core features are recovered without detrimental interference.
- Optimization-Based Losses: OptMerge (Wei et al., 26 May 2025) denoises task vectors via low-rank projection, then minimizes an interference loss penalizing overlap in parameter-space “energy.”
- Adaptive/Regularized Coefficients: AMM (Yin et al., 10 Oct 2025) combines static task importance weights with dynamic compatibility and projection penalties, ensuring per-layer merges remain in directions aligned with specialist tasks.
A typical empirical ablation demonstrates additive improvement from sparsity, adaptive scaling, and projection penalties, especially for LoRA-based (PEFT) merges (Zeng et al., 24 Feb 2025, Huang et al., 2024).
4. Merging for Heterogeneous and Lifelong Multimodal Systems
Recent merging approaches address emerging scenarios in multimodal systems: heterogeneity in model architectures and continual or temporal integration of new modalities.
Heterogeneous Models: AdaMMS (Du et al., 31 Mar 2025) generalizes weight-space merging to mismatched layers and modular structure by mapping corresponding modules and restricting merging to aligned components. Unmapped (extra) blocks are carried forward unaltered.
Temporal Merging: The TIME framework (Dziadzio et al., 2024) analyzes continual integration, proposing “init_EMA + deploy_EMA”: experts are recursively merged via exponential moving averages, with simple weight/arithmetic merges sufficing for strong retention and adaptation across many tasks through time.
Extensibility and Data-Free Fusion: Composition approaches such as NaiveMC and DAMC (Chen et al., 2024) allow arbitrary extensibility: to add a new modality, merge only the LLM backbone and plug in the new encoder, with parameter decoupling to minimize interference. These strategies underpin data-free, ongoing expansion of MLLM capabilities.
5. Empirical Evaluations and Benchmarks
A suite of benchmarks and domain-specific metrics is standard for evaluating merged multimodal models:
- Task Suites: MCUB (commonality reasoning) (Chen et al., 2024), multimodal reasoning (VQA, Geometry, ChartQA, Grounding, OCR) (Wei et al., 26 May 2025), coding+vision (MMCode, InfiBench-V) (Jiang et al., 13 Aug 2025).
- Zero-Shot and Transfer: Model merging approaches (MAM (Sundar et al., 2023)) can enable zero-shot gains in speech/audio from pretraining on text/image, highlighting the benefits of attention transfer even when labeled data is scarce.
- Catastrophic Forgetting: Locate-then-Merge and MMER report near-perfect (≥99%) retention of previous unimodal abilities after merging, outperforming naive and training-based baselines (Yu et al., 22 May 2025, Li et al., 21 May 2025).
- Performance Trade-offs: Merged models can surpass individual experts in joint or unseen modalities (“complementarity”), often outperform supervised mixture-trained models in benchmark averages (Wei et al., 26 May 2025, Zhang et al., 30 Sep 2025).
| Model/Approach | Benchmark | SOTA/Notable Results |
|---|---|---|
| VisCodex (Jiang et al., 13 Aug 2025) | InfiBench-V, ChartMimic | +1–5 points over no-merge, close to GPT-4o |
| OptMerge (Wei et al., 26 May 2025) | InternVL2.5, Qwen2-VL benchmarks | +1–4 pts over prior merges, matches mixture SFT |
| MMER (Li et al., 21 May 2025) | MCUB, MUSIC-AVQA | +1.9 points MCUB, ≥99% retention |
| PlaM (Wang et al., 12 Jan 2026) | MMStar, MMMU, GQA, MME | +0.8–1.2 points over base |
6. Theoretical Principles and Open Problems
The theoretical basis of multimodal model merging predominantly rests on:
- Linear Mode Connectivity: Most pretrained/fine-tuned models remain in a flat basin, justifying interpolation and arithmetic in weight/task-vector space (Yang et al., 2024).
- NTK/Disentanglement Theory: Modality/task updates that inhabit orthogonal subspaces can be jointly merged without interference (Yang et al., 2024).
- Empirical Indicators: Weight distance and sign-alignment metrics can predict merge success (Sung et al., 2023).
- Limitations & Open Questions: Merging highly heterogeneous models (dissimilar architectures, pretraining, or data) remains challenging; robustly establishing theoretical conditions for when merged models preserve all expert skills (without retraining or data calibration) is unresolved (Yang et al., 2024).
Among outstanding challenges are:
- Data-free fusion across more than four modalities and arbitrarily large parameter spaces.
- Improved automatic selection and tuning of layerwise or chunkwise merge coefficients (Du et al., 31 Mar 2025, Zhang et al., 30 Sep 2025).
- Robustness to adversarial interference and dynamic, input-conditional merges (Yang et al., 2024).
- Extensible approaches for architectures without aligned component mapping (Yang et al., 2024, Du et al., 31 Mar 2025).
7. Applications and Future Directions
Multimodal model merging is now a primary technique for:
- Training-free composition of “omnimodal” LLMs spanning vision, audio, video, code, and geometric inputs (Wei et al., 26 May 2025, Chen et al., 2024).
- Catastrophic forgetting mitigation under sequential task additions (Yu et al., 22 May 2025, Li et al., 21 May 2025, Dziadzio et al., 2024).
- Lifelong/continual learning within decentralized and federated model repositories (Dziadzio et al., 2024).
- Building lightweight, high-accuracy, unified reasoning models (e.g., Tiny-R1V with AMM (Yin et al., 10 Oct 2025)) for edge or real-time multimodal inference.
The combination of merging (for “fast path” fusion of experts), masking (for retention/decoupling), and targeted regularization (for conflict mitigation) is likely to remain foundational as the scale and scope of multimodal models expand. Benchmarks such as MCUB, InfiBench-V, and the FoMo-in-Flux suite will enable sustained, comparable empirical progress. Advances in unsupervised hyperparameter search, layer/chunk-aware merging, and dynamic/soft routers represent active research directions.
Ongoing work aims for robust, scalable, and theory-grounded model merging frameworks that can efficiently support continual multimodal expansion with minimal catastrophic forgetting and maximal synergy (Yang et al., 2024, Wang et al., 12 Jan 2026, Wei et al., 26 May 2025).