Multimodal Model Merging

Updated 2 February 2026

Multimodal Model Merging is the algorithmic fusion of deep neural models from diverse modalities to create a unified parameter set that preserves individual expertise.
It integrates methods like weight-space interpolation, sparsification, and attention alignment to balance modality-specific capabilities and minimize performance interference.
Emerging strategies address challenges such as catastrophic forgetting and heterogeneous architecture merging, paving the way for scalable, lifelong multimodal systems.

Multimodal Model Merging is the process of algorithmically fusing multiple deep neural models—each trained on different modalities or multimodal tasks—into a single parameter set that expresses the complementarity of the original networks. This approach enables the construction of unified, zero-shot capable large models that leverage the diversity of independently optimized experts for language, vision, audio, and other modalities, without resource-intensive joint fine-tuning or access to the original pretraining data. The field integrates advances in weight-space interpolation, sparsification, attention alignment, and importance-driven fusion, with recent work extending to parameter-efficient, heterogeneous-architecture, and lifelong (temporal) multimodal settings. Key challenges include mitigating catastrophic forgetting, balancing modality-specific and cross-modal capabilities, and reducing performance interference among constituent experts.

1. Foundations and Motivations

Model merging emerged to address practical limitations in large-scale multimodal systems: the cost and inflexibility of joint retraining, the need for rapid integration of modality/domain specialists, and the desire to preserve or superimpose fine-grained expert capabilities within a single deployable backbone. In the multimodal setting, this means combining models such as vision–language, audio–language, or point cloud–language LLMs, or deeper fusions of visual, audio, and coding LLMs into a single MLLM.

The foundational principle is the weight-space connectivity of deep networks: independently fine-tuned models, originating from a shared initialization, typically reside in a connected (flat) loss basin, permitting direct parameter-space interpolation or arithmetic to fuse expertise (Yang et al., 2024). Empirical studies confirm that simple forms of layerwise or vectorial merging suffice to approximately preserve performance on both constituent modalities and joint tasks, even in the absence of further data (Sung et al., 2023, Takmaz et al., 2 Oct 2025, Wang et al., 12 Jan 2026).

2. Core Methodologies

Multimodal model merging now encompasses a rich taxonomy of parameter fusion strategies, most of which can be organized into the following categories:

A. Linear and Arithmetic Merging

The simplest approach linearly interpolates two (or more) parameter sets sharing the same architecture and initialization: $\theta_{\text{merge}} = \alpha \theta_1 + (1-\alpha) \theta_2,\ \alpha\in[0,1]$ or, more generally, for N experts: $\theta_{\text{merge}} = \sum_{i=1}^N \lambda_i \theta_i \qquad \text{with}\ \sum_{i=1}^N \lambda_i=1$ Task arithmetic operates in task-vector space relative to a shared base: $\theta_{\text{merge}} = \theta_{\text{base}} + \sum_{i=1}^N \lambda_i (\theta_i - \theta_{\text{base}})$ This family underlies classic model soups, task vector addition, and layerwise averaging (Yang et al., 2024, Takmaz et al., 2 Oct 2025, Wei et al., 26 May 2025).

B. Parameter-Efficient and PEFT Merging

When specialist models are trained via parameter-efficient methods (e.g. LoRA, adapters), merging is confined to low-rank update matrices: $\Delta\theta^{\rm merge} = \sum_{i=1}^N \lambda_i \Delta\theta_i$ CoPA-Merging introduces structured pruning of singular vectors, scaling, and cross-task normalization to maintain directionality and balance contributions of each expert under LoRA adaptation (Zeng et al., 24 Feb 2025).

C. Alignment and Modular Decoupling

Methods such as DAMC (Chen et al., 2024) and MMER (Li et al., 21 May 2025) decouple subnetworks or generate masks to isolate modality-specific parameters. MMER’s process:

Merge fine-tuned models in the task-vector domain using sign-consistent summation (e.g., TIES merge).
Construct binary masks $M_i$ for each modality to isolate only coordinates whose directional contribution matches the merged vector and exceeds a significance threshold.
At inference, input for modality $i$ is routed via the corresponding mask to avoid destructive interference.

D. Neuron and Layerwise Selective Fusion

Locate-then-Merge and Neuron-Fusion (Yu et al., 22 May 2025) apply neuron-level selection, preserving large-shift (“vision-specialist”) neurons and suppressing diffused shifts that induce forgetting. Layer-level or blockwise merging—such as plateau-guided merging in PlaM (Wang et al., 12 Jan 2026)—restores base LLM weights in late stages to recover language reasoning while preserving cross-modal alignment earlier.

E. Heterogeneous and Architecture-Aware Merging

AdaMMS merges MLLMs with different architectures (e.g., layer duplication, additional QKV heads) by mapping, linearly interpolating corresponding tensors, and searching for optimal interpolation coefficients via unsupervised consistency on unlabeled data (Du et al., 31 Mar 2025).

F. Optimization-Based and Chunked Merging

Expert Merging (Zhang et al., 30 Sep 2025) and its extension, Expert Merging++, learn a small set of per-layer (or chunk) coefficients via loss alignment of hidden states/logits with the experts, regularized for stability, often using only a modest calibration set.

Method Name	Key Idea/Formula	Notable Applications
Linear Interp./TA	$\theta_{\text{merge}} = \sum \lambda_i \theta_i$	Language/vision fusion (Takmaz et al., 2 Oct 2025, Sung et al., 2023)
MMER	Masked, sign-consistent task vector merging & decoupling	Multimodal expansion/retention, catastrophic forgetting mitigation (Li et al., 21 May 2025)
Neuron-Fusion	Select large-shift neurons; suppress/restore	Mitigate language drop in MLLMs (Yu et al., 22 May 2025)
AdaMMS	Mapping-based blockwise merge + unsupervised coeff. search	Heterogeneous backbone merges (Du et al., 31 Mar 2025)
Plateau-guided	Restore LLM weights in plateau (late) layers	Language retention, visual grounding (Wang et al., 12 Jan 2026)
Expert Merging(++)	Layerwise/chunkwise coefficients via alignment	Multi-expert LLM/MLLM (Zhang et al., 30 Sep 2025)

3. Advanced Strategies: Interference, Sparsification, and Regularization

Central challenges in multimodal merging are performance interference, gradient conflicts, and catastrophic forgetting—especially pronounced when fusing highly heterogenous tasks or modalities. Several classes of solutions have been proposed:

Sparsity and Masking: EMR-Merging (Huang et al., 2024) elects a unified task vector using majority sign, per-task masks incompatible directions, and rescales magnitude such that each expert’s core features are recovered without detrimental interference.
Optimization-Based Losses: OptMerge (Wei et al., 26 May 2025) denoises task vectors via low-rank projection, then minimizes an interference loss penalizing overlap in parameter-space “energy.”
Adaptive/Regularized Coefficients: AMM (Yin et al., 10 Oct 2025) combines static task importance weights with dynamic compatibility and projection penalties, ensuring per-layer merges remain in directions aligned with specialist tasks.

A typical empirical ablation demonstrates additive improvement from sparsity, adaptive scaling, and projection penalties, especially for LoRA-based (PEFT) merges (Zeng et al., 24 Feb 2025, Huang et al., 2024).

4. Merging for Heterogeneous and Lifelong Multimodal Systems

Recent merging approaches address emerging scenarios in multimodal systems: heterogeneity in model architectures and continual or temporal integration of new modalities.

Heterogeneous Models: AdaMMS (Du et al., 31 Mar 2025) generalizes weight-space merging to mismatched layers and modular structure by mapping corresponding modules and restricting merging to aligned components. Unmapped (extra) blocks are carried forward unaltered.

Temporal Merging: The TIME framework (Dziadzio et al., 2024) analyzes continual integration, proposing “init_EMA + deploy_EMA”: experts are recursively merged via exponential moving averages, with simple weight/arithmetic merges sufficing for strong retention and adaptation across many tasks through time.

Extensibility and Data-Free Fusion: Composition approaches such as NaiveMC and DAMC (Chen et al., 2024) allow arbitrary extensibility: to add a new modality, merge only the LLM backbone and plug in the new encoder, with parameter decoupling to minimize interference. These strategies underpin data-free, ongoing expansion of MLLM capabilities.

5. Empirical Evaluations and Benchmarks

A suite of benchmarks and domain-specific metrics is standard for evaluating merged multimodal models:

Task Suites: MCUB (commonality reasoning) (Chen et al., 2024), multimodal reasoning (VQA, Geometry, ChartQA, Grounding, OCR) (Wei et al., 26 May 2025), coding+vision (MMCode, InfiBench-V) (Jiang et al., 13 Aug 2025).
Zero-Shot and Transfer: Model merging approaches (MAM (Sundar et al., 2023)) can enable zero-shot gains in speech/audio from pretraining on text/image, highlighting the benefits of attention transfer even when labeled data is scarce.
Catastrophic Forgetting: Locate-then-Merge and MMER report near-perfect (≥99%) retention of previous unimodal abilities after merging, outperforming naive and training-based baselines (Yu et al., 22 May 2025, Li et al., 21 May 2025).
Performance Trade-offs: Merged models can surpass individual experts in joint or unseen modalities (“complementarity”), often outperform supervised mixture-trained models in benchmark averages (Wei et al., 26 May 2025, Zhang et al., 30 Sep 2025).

Model/Approach	Benchmark	SOTA/Notable Results
VisCodex (Jiang et al., 13 Aug 2025)	InfiBench-V, ChartMimic	+1–5 points over no-merge, close to GPT-4o
OptMerge (Wei et al., 26 May 2025)	InternVL2.5, Qwen2-VL benchmarks	+1–4 pts over prior merges, matches mixture SFT
MMER (Li et al., 21 May 2025)	MCUB, MUSIC-AVQA	+1.9 points MCUB, ≥99% retention
PlaM (Wang et al., 12 Jan 2026)	MMStar, MMMU, GQA, MME	+0.8–1.2 points over base

6. Theoretical Principles and Open Problems

The theoretical basis of multimodal model merging predominantly rests on:

Linear Mode Connectivity: Most pretrained/fine-tuned models remain in a flat basin, justifying interpolation and arithmetic in weight/task-vector space (Yang et al., 2024).
NTK/Disentanglement Theory: Modality/task updates that inhabit orthogonal subspaces can be jointly merged without interference (Yang et al., 2024).
Empirical Indicators: Weight distance and sign-alignment metrics can predict merge success (Sung et al., 2023).
Limitations & Open Questions: Merging highly heterogeneous models (dissimilar architectures, pretraining, or data) remains challenging; robustly establishing theoretical conditions for when merged models preserve all expert skills (without retraining or data calibration) is unresolved (Yang et al., 2024).

Among outstanding challenges are:

Data-free fusion across more than four modalities and arbitrarily large parameter spaces.
Improved automatic selection and tuning of layerwise or chunkwise merge coefficients (Du et al., 31 Mar 2025, Zhang et al., 30 Sep 2025).
Robustness to adversarial interference and dynamic, input-conditional merges (Yang et al., 2024).
Extensible approaches for architectures without aligned component mapping (Yang et al., 2024, Du et al., 31 Mar 2025).

7. Applications and Future Directions

Multimodal model merging is now a primary technique for:

Training-free composition of “omnimodal” LLMs spanning vision, audio, video, code, and geometric inputs (Wei et al., 26 May 2025, Chen et al., 2024).
Catastrophic forgetting mitigation under sequential task additions (Yu et al., 22 May 2025, Li et al., 21 May 2025, Dziadzio et al., 2024).
Lifelong/continual learning within decentralized and federated model repositories (Dziadzio et al., 2024).
Building lightweight, high-accuracy, unified reasoning models (e.g., Tiny-R1V with AMM (Yin et al., 10 Oct 2025)) for edge or real-time multimodal inference.

The combination of merging (for “fast path” fusion of experts), masking (for retention/decoupling), and targeted regularization (for conflict mitigation) is likely to remain foundational as the scale and scope of multimodal models expand. Benchmarks such as MCUB, InfiBench-V, and the FoMo-in-Flux suite will enable sustained, comparable empirical progress. Advances in unsupervised hyperparameter search, layer/chunk-aware merging, and dynamic/soft routers represent active research directions.

Ongoing work aims for robust, scalable, and theory-grounded model merging frameworks that can efficiently support continual multimodal expansion with minimal catastrophic forgetting and maximal synergy (Yang et al., 2024, Wang et al., 12 Jan 2026, Wei et al., 26 May 2025).

Markdown Upgrade to Chat

References (16)

Model Merging in LLMs, MLLMs, and Beyond: Methods, Theories, Applications and Opportunities (2024)

An Empirical Study of Multimodal Model Merging (2023)

Model Merging to Maintain Language-Only Performance in Developmentally Plausible Multimodal Models (2025)

PlaM: Training-Free Plateau-Guided Model Merging for Better Visual Grounding in MLLMs (2026)

OptMerge: Unifying Multimodal LLM Capabilities and Modalities via Model Merging (2025)

Parameter Efficient Merging for Multimodal Large Language Models with Complementary Parameter Adaptation (2025)

Model Composition for Multimodal Large Language Models (2024)

Multi-Modality Expansion and Retention for LLMs through Parameter Merging and Decoupling (2025)

Locate-then-Merge: Neuron-Level Parameter Fusion for Mitigating Catastrophic Forgetting in Multimodal LLMs (2025)

10.

AdaMMS: Model Merging for Heterogeneous Multimodal Large Language Models with Unsupervised Coefficient Optimization (2025)

11.

Expert Merging: Model Merging with Unsupervised Expert Alignment and Importance-Guided Layer Chunking (2025)

12.

EMR-Merging: Tuning-Free High-Performance Model Merging (2024)

13.

Tiny-R1V: Lightweight Multimodal Unified Reasoning Model via Model Merging (2025)

14.

How to Merge Your Multimodal Models Over Time? (2024)

15.

VisCodex: Unified Multimodal Code Generation via Merging Vision and Coding Models (2025)

16.

Multimodal Attention Merging for Improved Speech Recognition and Audio Event Classification (2023)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Multimodal Model Merging.