Papers
Topics
Authors
Recent
Search
2000 character limit reached

Multimodal Model Merging

Updated 2 February 2026
  • Multimodal Model Merging is the algorithmic fusion of deep neural models from diverse modalities to create a unified parameter set that preserves individual expertise.
  • It integrates methods like weight-space interpolation, sparsification, and attention alignment to balance modality-specific capabilities and minimize performance interference.
  • Emerging strategies address challenges such as catastrophic forgetting and heterogeneous architecture merging, paving the way for scalable, lifelong multimodal systems.

Multimodal Model Merging is the process of algorithmically fusing multiple deep neural models—each trained on different modalities or multimodal tasks—into a single parameter set that expresses the complementarity of the original networks. This approach enables the construction of unified, zero-shot capable large models that leverage the diversity of independently optimized experts for language, vision, audio, and other modalities, without resource-intensive joint fine-tuning or access to the original pretraining data. The field integrates advances in weight-space interpolation, sparsification, attention alignment, and importance-driven fusion, with recent work extending to parameter-efficient, heterogeneous-architecture, and lifelong (temporal) multimodal settings. Key challenges include mitigating catastrophic forgetting, balancing modality-specific and cross-modal capabilities, and reducing performance interference among constituent experts.

1. Foundations and Motivations

Model merging emerged to address practical limitations in large-scale multimodal systems: the cost and inflexibility of joint retraining, the need for rapid integration of modality/domain specialists, and the desire to preserve or superimpose fine-grained expert capabilities within a single deployable backbone. In the multimodal setting, this means combining models such as vision–language, audio–language, or point cloud–language LLMs, or deeper fusions of visual, audio, and coding LLMs into a single MLLM.

The foundational principle is the weight-space connectivity of deep networks: independently fine-tuned models, originating from a shared initialization, typically reside in a connected (flat) loss basin, permitting direct parameter-space interpolation or arithmetic to fuse expertise (Yang et al., 2024). Empirical studies confirm that simple forms of layerwise or vectorial merging suffice to approximately preserve performance on both constituent modalities and joint tasks, even in the absence of further data (Sung et al., 2023, Takmaz et al., 2 Oct 2025, Wang et al., 12 Jan 2026).

2. Core Methodologies

Multimodal model merging now encompasses a rich taxonomy of parameter fusion strategies, most of which can be organized into the following categories:

A. Linear and Arithmetic Merging

The simplest approach linearly interpolates two (or more) parameter sets sharing the same architecture and initialization: θmerge=αθ1+(1α)θ2, α[0,1]\theta_{\text{merge}} = \alpha \theta_1 + (1-\alpha) \theta_2,\ \alpha\in[0,1] or, more generally, for N experts: θmerge=i=1Nλiθiwith i=1Nλi=1\theta_{\text{merge}} = \sum_{i=1}^N \lambda_i \theta_i \qquad \text{with}\ \sum_{i=1}^N \lambda_i=1 Task arithmetic operates in task-vector space relative to a shared base: θmerge=θbase+i=1Nλi(θiθbase)\theta_{\text{merge}} = \theta_{\text{base}} + \sum_{i=1}^N \lambda_i (\theta_i - \theta_{\text{base}}) This family underlies classic model soups, task vector addition, and layerwise averaging (Yang et al., 2024, Takmaz et al., 2 Oct 2025, Wei et al., 26 May 2025).

B. Parameter-Efficient and PEFT Merging

When specialist models are trained via parameter-efficient methods (e.g. LoRA, adapters), merging is confined to low-rank update matrices: Δθmerge=i=1NλiΔθi\Delta\theta^{\rm merge} = \sum_{i=1}^N \lambda_i \Delta\theta_i CoPA-Merging introduces structured pruning of singular vectors, scaling, and cross-task normalization to maintain directionality and balance contributions of each expert under LoRA adaptation (Zeng et al., 24 Feb 2025).

C. Alignment and Modular Decoupling

Methods such as DAMC (Chen et al., 2024) and MMER (Li et al., 21 May 2025) decouple subnetworks or generate masks to isolate modality-specific parameters. MMER’s process:

  • Merge fine-tuned models in the task-vector domain using sign-consistent summation (e.g., TIES merge).
  • Construct binary masks MiM_i for each modality to isolate only coordinates whose directional contribution matches the merged vector and exceeds a significance threshold.
  • At inference, input for modality ii is routed via the corresponding mask to avoid destructive interference.

D. Neuron and Layerwise Selective Fusion

Locate-then-Merge and Neuron-Fusion (Yu et al., 22 May 2025) apply neuron-level selection, preserving large-shift (“vision-specialist”) neurons and suppressing diffused shifts that induce forgetting. Layer-level or blockwise merging—such as plateau-guided merging in PlaM (Wang et al., 12 Jan 2026)—restores base LLM weights in late stages to recover language reasoning while preserving cross-modal alignment earlier.

E. Heterogeneous and Architecture-Aware Merging

AdaMMS merges MLLMs with different architectures (e.g., layer duplication, additional QKV heads) by mapping, linearly interpolating corresponding tensors, and searching for optimal interpolation coefficients via unsupervised consistency on unlabeled data (Du et al., 31 Mar 2025).

F. Optimization-Based and Chunked Merging

Expert Merging (Zhang et al., 30 Sep 2025) and its extension, Expert Merging++, learn a small set of per-layer (or chunk) coefficients via loss alignment of hidden states/logits with the experts, regularized for stability, often using only a modest calibration set.

Method Name Key Idea/Formula Notable Applications
Linear Interp./TA θmerge=λiθi\theta_{\text{merge}} = \sum \lambda_i \theta_i Language/vision fusion (Takmaz et al., 2 Oct 2025, Sung et al., 2023)
MMER Masked, sign-consistent task vector merging & decoupling Multimodal expansion/retention, catastrophic forgetting mitigation (Li et al., 21 May 2025)
Neuron-Fusion Select large-shift neurons; suppress/restore Mitigate language drop in MLLMs (Yu et al., 22 May 2025)
AdaMMS Mapping-based blockwise merge + unsupervised coeff. search Heterogeneous backbone merges (Du et al., 31 Mar 2025)
Plateau-guided Restore LLM weights in plateau (late) layers Language retention, visual grounding (Wang et al., 12 Jan 2026)
Expert Merging(++) Layerwise/chunkwise coefficients via alignment Multi-expert LLM/MLLM (Zhang et al., 30 Sep 2025)

3. Advanced Strategies: Interference, Sparsification, and Regularization

Central challenges in multimodal merging are performance interference, gradient conflicts, and catastrophic forgetting—especially pronounced when fusing highly heterogenous tasks or modalities. Several classes of solutions have been proposed:

  • Sparsity and Masking: EMR-Merging (Huang et al., 2024) elects a unified task vector using majority sign, per-task masks incompatible directions, and rescales magnitude such that each expert’s core features are recovered without detrimental interference.
  • Optimization-Based Losses: OptMerge (Wei et al., 26 May 2025) denoises task vectors via low-rank projection, then minimizes an interference loss penalizing overlap in parameter-space “energy.”
  • Adaptive/Regularized Coefficients: AMM (Yin et al., 10 Oct 2025) combines static task importance weights with dynamic compatibility and projection penalties, ensuring per-layer merges remain in directions aligned with specialist tasks.

A typical empirical ablation demonstrates additive improvement from sparsity, adaptive scaling, and projection penalties, especially for LoRA-based (PEFT) merges (Zeng et al., 24 Feb 2025, Huang et al., 2024).

4. Merging for Heterogeneous and Lifelong Multimodal Systems

Recent merging approaches address emerging scenarios in multimodal systems: heterogeneity in model architectures and continual or temporal integration of new modalities.

Heterogeneous Models: AdaMMS (Du et al., 31 Mar 2025) generalizes weight-space merging to mismatched layers and modular structure by mapping corresponding modules and restricting merging to aligned components. Unmapped (extra) blocks are carried forward unaltered.

Temporal Merging: The TIME framework (Dziadzio et al., 2024) analyzes continual integration, proposing “init_EMA + deploy_EMA”: experts are recursively merged via exponential moving averages, with simple weight/arithmetic merges sufficing for strong retention and adaptation across many tasks through time.

Extensibility and Data-Free Fusion: Composition approaches such as NaiveMC and DAMC (Chen et al., 2024) allow arbitrary extensibility: to add a new modality, merge only the LLM backbone and plug in the new encoder, with parameter decoupling to minimize interference. These strategies underpin data-free, ongoing expansion of MLLM capabilities.

5. Empirical Evaluations and Benchmarks

A suite of benchmarks and domain-specific metrics is standard for evaluating merged multimodal models:

Model/Approach Benchmark SOTA/Notable Results
VisCodex (Jiang et al., 13 Aug 2025) InfiBench-V, ChartMimic +1–5 points over no-merge, close to GPT-4o
OptMerge (Wei et al., 26 May 2025) InternVL2.5, Qwen2-VL benchmarks +1–4 pts over prior merges, matches mixture SFT
MMER (Li et al., 21 May 2025) MCUB, MUSIC-AVQA +1.9 points MCUB, ≥99% retention
PlaM (Wang et al., 12 Jan 2026) MMStar, MMMU, GQA, MME +0.8–1.2 points over base

6. Theoretical Principles and Open Problems

The theoretical basis of multimodal model merging predominantly rests on:

  • Linear Mode Connectivity: Most pretrained/fine-tuned models remain in a flat basin, justifying interpolation and arithmetic in weight/task-vector space (Yang et al., 2024).
  • NTK/Disentanglement Theory: Modality/task updates that inhabit orthogonal subspaces can be jointly merged without interference (Yang et al., 2024).
  • Empirical Indicators: Weight distance and sign-alignment metrics can predict merge success (Sung et al., 2023).
  • Limitations & Open Questions: Merging highly heterogeneous models (dissimilar architectures, pretraining, or data) remains challenging; robustly establishing theoretical conditions for when merged models preserve all expert skills (without retraining or data calibration) is unresolved (Yang et al., 2024).

Among outstanding challenges are:

7. Applications and Future Directions

Multimodal model merging is now a primary technique for:

The combination of merging (for “fast path” fusion of experts), masking (for retention/decoupling), and targeted regularization (for conflict mitigation) is likely to remain foundational as the scale and scope of multimodal models expand. Benchmarks such as MCUB, InfiBench-V, and the FoMo-in-Flux suite will enable sustained, comparable empirical progress. Advances in unsupervised hyperparameter search, layer/chunk-aware merging, and dynamic/soft routers represent active research directions.

Ongoing work aims for robust, scalable, and theory-grounded model merging frameworks that can efficiently support continual multimodal expansion with minimal catastrophic forgetting and maximal synergy (Yang et al., 2024, Wang et al., 12 Jan 2026, Wei et al., 26 May 2025).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (16)

Topic to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Multimodal Model Merging.