OptMerge: Unifying Multimodal LLM Capabilities and Modalities via Model Merging

Published 26 May 2025 in cs.AI | (2505.19892v2)

Abstract: Foundation models update slowly due to resource-intensive training, whereas domain-specific models evolve rapidly between releases. Model merging seeks to combine multiple expert models into a single, more capable model, reducing storage and serving costs while supporting decentralized development. Despite its potential, previous studies have primarily focused on merging visual classification models or LLMs for code and math tasks. Recently, Multimodal LLMs (MLLMs) that extend LLMs through large-scale multimodal training have gained traction. However, there lacks a benchmark for model merging research that clearly divides the tasks for MLLM training and evaluation. In this paper, $\textbf{(i)}$ we introduce a model merging benchmark for MLLMs, which includes multiple tasks such as VQA, Geometry, Chart, OCR, and Grounding, studying both LoRA and full fine-tuning models. Moreover, we explore how model merging can combine different modalities (e.g., vision-language, audio-language, and video-LLMs), moving toward the Omni-LLM. $\textbf{(ii)}$ We implement 10 model merging algorithms on the benchmark. Furthermore, we propose a novel method that removes noise from task vectors and robustly optimizes the merged vector based on a loss defined over task vector interactions, achieving an average performance gain of 2.48%. $\textbf{(iii)}$ We find that model merging offers a promising way for building improved MLLMs without requiring training data. Our results also demonstrate that the complementarity among multiple modalities outperforms individual modalities.

Abstract PDF Upgrade to Chat

Authors (10)

Summary

The paper presents OptMerge, a data-free model merging framework that employs SVD-based truncation and optimized initialization to integrate multimodal LLMs without retraining.
It achieves a 4.65% performance boost over state-of-the-art methods on tasks like VQA, geometry, and chart analysis while lowering computational costs.
Empirical findings underscore the importance of selecting experts with minimal parameter drift to ensure stable, noise-resistant task vector integration across modalities.

OptMerge: Unifying Multimodal LLM Capabilities via Model Merging

Introduction and Motivation

The paper presents OptMerge, a comprehensive framework for data-free model merging in the context of Multimodal LLMs (MLLMs). The motivation stems from the stagnant update cycles of general-purpose foundation models contrasted with rapid advancements in domain-specific expert models due to resource-intensive training bottlenecks. Model merging is posited as a scalable solution to combine complementary, fine-tuned expert models, thus unifying multiple capabilities or modalities in a single deployment. This is achieved without retraining or access to original data, thereby reducing both computational burden and the cost of serving diverse applications.

Model Merging Algorithms and Task Vector Theory

The authors provide a systematic categorization of static, data-free model merging algorithms. Four principal types are enumerated:

Linear Interpolation Methods: Direct averaging or arithmetic combination of task vectors, which are parameter differences between fine-tuned models and their shared base.
Sparsification-based Methods: Reduce redundancy or interference in task vectors via sparsity constraints and selective parameter inclusion.
SVD-based Methods: Exploit low-rank structure, orthogonalize task vectors, and isolate dominant singular components to mitigate task interference.
Optimization-based Methods: Frame the merging procedure as a parameter-space optimization problem, typically minimizing layer-wise interference metrics over task vectors using gradient descent.

A novel contribution here is a theoretical upper bound relating merging loss to the learning rate and number of fine-tuning iterations ( $\mathcal{O}(\eta T)$ ), formalizing the notion that excessive fine-tuning increases parameter drift and complicates task vector integration.

Figure 1: Task vector magnitude distribution in InternVL2.5 reveals right-skewed fine-tuning changes, vital for successful model merging.

Further, empirical analysis using Frobenius norms per layer corroborates that effective merging requires closely clustered expert models in parameter space. For low-rank adaptation approaches like LoRA, task vectors cluster in restricted subspaces, amplifying the need for rank-aware merging strategies.

OptMerge Methodology: Robust Optimization of Task Vectors

The key contribution is OptMerge, instantiated to address discovered instability and noise amplification in existing data-free model merging workflows. Two specialized instantiations are described:

Full Fine-Tuned Models: OptMerge executes SVD on centered task vectors, truncating noise-dominated singular components. The optimization loss leverages only dominant components, analogous to PCA, thereby preserving shared knowledge and reducing destructive interference.
LoRA Fine-Tuned Models: OptMerge circumvents null-space gradient sparsity by initializing with averaged vectors and switching to SGD (exploiting implicit regularization). Direct SVD-based truncation maintains stable vector norms while avoiding shortcuts that inflate parameter magnitudes.

Figure 2: When optimizing the interference loss, merged task vectors can take degenerate shortcuts by increasing their norm—OptMerge regularizes this optimization.

Ablation analysis shows that initializing merge vectors and applying low-rank truncation offer a 4.65% average performance boost over previous state-of-the-art optimization-based merging algorithms.

Benchmark Construction and Evaluation Protocol

The authors introduce a meticulously curated model merging benchmark addressing five canonical MLLM capabilities: VQA, Geometry, Chart, OCR, and Grounding, with each backed by over 100k samples for supervised fine-tuning. Two base models—InternVL2.5 for instruction-following and Qwen2-VL-7B for domain-general tasks—are selected with both LoRA and full fine-tuning variants.

Downstream evaluation benchmarks are chosen to target specific subtasks rather than general holistic capability assessments, enabling nuanced analysis post-merging. Modality merging further extends the approach to include vision, audio, and video-language encoders, demonstrating the capacity for integrating tri-modal information streams.

Empirical Findings and Strong Numerical Claims

Experimental results yield several substantive findings:

Model merging can outperform expert models and mixture multi-task training. For example, OptMerge merged Qwen2-VL achieves geometry scores of 51.05 and 40.79, compared to individual experts' 42.50 and 28.95, and chart scores of 79.76 against a 61.08 baseline.
Optimal merging does not coincide with maximum fine-tuning steps—excessive fine-tuning leads to degraded merging performance due to parameter drift, as visualized by convergence plots and method comparisons.
Figure 3: CLIP accuracy across eight datasets converges with training; fine-tuning does not guarantee better merging.

Figure 4: Across eight datasets, merging accuracy rises then falls with more fine-tuning—contradicting standard assumptions about model quality.

Modality merging achieves higher average accuracy (up to 67.00%) than naively composing or averaging activations across independently trained vision/audio/video models.
OptMerge maintains parameter norm stability, preventing collapse in merged LLMs and successfully incorporating knowledge from disparate domains and modalities.

Computational cost analysis reveals orders-of-magnitude savings: model merging completes in ≤4 hours and ≤22GB memory for 7B models, compared to ~25 hours and >240GB for multi-task mixture training.

Practical and Theoretical Implications

OptMerge demonstrates that data-free merging of expert MLLMs can outstrip multi-task mixture training in both effectiveness and computational tractability. This offers a flexible paradigm for continual capability integration and decentralized development, as open-source communities routinely contribute domain-specialized checkpoints. The benchmark provides a robust platform for fair, replicable evaluation of future model merging algorithms.

Theoretical results clarify the constraints introduced by parameter drift, informing best practices: select expert models with minimal fine-tuning divergence from base parameters, and prioritize less aggressively fine-tuned models for merging.

On a broader scope, this encourages the pursuit of omni-modal LLMs assembled without re-training, where capabilities are modular, extensible, and privacy-preserving. Open questions remain regarding generalization on multilingual data, further scaling, and reasoning-focused MLLM merging.

Conclusion

OptMerge unifies MLLM capabilities and modalities through principled, data-free model merging, introducing a stable optimization approach that denoises task vectors, regularizes parameter norms, and consistently achieves superior downstream performance. The main takeaways include the efficacy of low-rank approximation, initialization strategies in optimization, and the importance of judicious expert model selection. Future directions involve scaling to larger checkpoints, deeper investigations into reasoning and multilingual integration, and enhanced benchmarks for evaluating omni-model alignment.

Markdown Report Issue

Paper to Video (Beta)

No one has generated a video about this paper yet.

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Paper Prompts

Top Community Prompts

Explain it Like I'm 14

off on

Knowledge Gaps

off on

Practical Applications

off on

Glossary

off on

Conceptual Simplification

off on

Explain it Like I'm 14

What is this paper about?

This paper looks at a clever way to build powerful AI models without doing lots of expensive training. The idea is called “model merging.” Instead of training one giant model on everything, you take several smaller “expert” models (each good at a specific skill) and combine them into one model that can do many things. The paper focuses on merging Multimodal LLMs (MLLMs) — AI that can understand text plus other inputs like images, audio, and video.

What questions does the paper try to answer?

Can we reliably merge different expert AI models into a single, more capable model without using training data?
What’s a fair way to test and compare merging methods for multimodal models?
How does the amount of fine-tuning (how much the expert models were changed from the original base model) affect the success of merging?
Can merging help combine different kinds of inputs (like images, sounds, and videos) into one “omni” model?
Is there a new merging method that works better than existing ones?

How did the researchers approach the problem?

Think of each expert model like a teammate with a special skill: one is great at answering questions about pictures (VQA), another understands charts, another reads text inside images (OCR), and another can point to objects in images (grounding). The goal is to combine their skills without retraining everything from scratch.

Here’s the approach in everyday terms:

Base model and expert models:
- Start with a base model (the original). Each expert is the base model that was tweaked (fine-tuned) for a specific task.
- A “task vector” is the difference between the expert model and the base model — like a recipe listing the changes that made the expert good at its specialty.
Benchmark they built:
- They created a clear test setup with five skills: VQA, Geometry (math diagrams), Chart understanding, OCR-based Q&A, and Grounding (finding objects in images).
- They gathered large, public datasets (100,000+ examples per skill) to fine-tune experts and to evaluate merging fairly.
- They also tried merging different modalities (vision-language, audio-language, video-language) into one “omni” model.
Merging methods they evaluated:
- Linear mixing: Simply add or average the “recipes” (task vectors).
- Sparsification: Drop redundant changes to reduce conflicts.
- SVD (low-rank) techniques: Compress changes to keep only the most important directions (like summarizing a long essay into key points).
- Optimization methods: Treat merging like a puzzle and use algorithms to minimize conflicts between tasks.
Their new method: OptMerge
- It cleans up noisy parts of the task vectors and focuses on the most important changes using SVD (a mathematical way to find key directions in data).
- It initializes the merged model smartly and uses stable optimization (like careful, steady adjustments) so the model doesn’t “explode” or lose its language skills.
- It adapts differently depending on how the experts were fine-tuned:
- Full fine-tuning: Many parameters changed — OptMerge denoises these changes.
- LoRA fine-tuning: Only small, low-rank add-ons were used — OptMerge keeps updates balanced and stable.
A key insight (simple theory):
- The success of merging depends on how much the expert models drifted from the base. If you use a big learning rate or train for too long, the experts move far away and become harder to merge.
- Small changes are easier to combine (but might slightly lower the expert’s solo performance). It’s about finding the balance.

What did they find and why does it matter?

Merging works — often better than training on mixed data:
- Across tasks like geometry, charts, OCR, and grounding, the merged models matched or beat models trained on combined datasets.
- Their OptMerge method achieved the best average results, with about a 2.48% improvement over a strong baseline.
Merging different modalities helps:
- Combining vision, audio, and video models led to better results than using any single modality alone.
- The merged “omni” model even outperformed some methods that combine modalities at inference time.
It’s fast and cheaper:
- Merging took hours and used much less GPU memory compared to days-long training runs with huge memory demands.
- No training data is needed for merging — you just need the model files.
Real-world test:
- They merged actual community models from Hugging Face (made by different people for different tasks and languages).
- The merged models were more robust and performed better on average than the individual experts.
Important tip:
- Don’t over-fine-tune the experts if you plan to merge them later. Models that stayed closer to the base were easier to merge and gave better combined performance.

What’s the impact?

Practical: Teams can build specialized models separately, then combine them later — saving time, money, and data access hassles.
Scalable: You can quickly create a strong multitask or multimodal model without retraining on massive datasets.
Community-friendly: Open-source models from different creators can be merged into better systems.
Future of “omni” AI: Merging is a promising path to AI that understands text, images, audio, and video together — like a single assistant that can watch a clip, listen to a sound, and read on-screen text.
Guidance for developers: If you plan to merge, keep fine-tuning mild and clean (limit parameter drift). OptMerge shows how to denoise and stabilize merges for best results.

Quick recap

Goal: Combine multiple expert MLLMs into one powerful model without retraining.
Solution: A new benchmark, tests of 10 merging methods, and OptMerge — a stable, denoising, optimization-based merger.
Results: Merging often beats mixed-data training, works across modalities, saves compute, and handles real community models.
Impact: Faster, cheaper, and more collaborative AI development — moving toward truly multimodal “omni” models.

View Paper Prompt View All Prompts

Glossary

Activation averaging: Averaging intermediate neural activations from different models or modalities during inference to compose capabilities. "NaiveMC~\citep{chen2024model} performs simple activation averaging, while DAMC~\citep{chen2024model} decouples parameters during training to reduce modal interference."
Adam optimizer: A stochastic gradient-based optimizer with adaptive learning rates and momentum used for parameter optimization. "Using the Adam optimizer, we obtain the merged vector $\boldsymbol{\tau}_{m,l}$ , which minimizes interference with task vectors on multiple tasks"
Audio-LLM: A model that processes audio inputs and generates or understands text, typically with an audio encoder and an LLM. "The audio-LLM adopts BEATs-Iter3+~\citep{chen2023beats} as the audio encoder, with a Q-Former as the connector."
Audio-VQA: Audio-Visual Question Answering; tasks requiring reasoning over visual and audio signals in videos. "For Omni-LLMs, we assess Audio-VQA, which requires multimodal understanding and spatiotemporal reasoning about visual objects, sounds, and their relationships in videos."
CLIP-ViT-L-336px: A specific vision encoder architecture from CLIP using a large ViT at 336px resolution. "The vision-LLM uses CLIP-ViT-L-336px~\citep{radford2021learning} as the image encoder, paired with an MLP projection as the connector."
Connector: A projection or module that maps encoder outputs into the LLM’s input space for multimodal integration. "The vision-LLM uses CLIP-ViT-L-336px~\citep{radford2021learning} as the image encoder, paired with an MLP projection as the connector."
Data-free methods: Merging approaches that do not require access to training or evaluation data to combine models. "Data-free methods merge fine-tuned models without requiring additional data."
DARE: A merging technique that drops and rescales task vectors to reduce interference and redundancy. "DARE~\citep{yu2024language} randomly drops redundant task vectors and rescales the remaining ones to mitigate parameter interference."
Decorrelation: Reducing correlations between task-specific parameter changes to mitigate interference during merging. "It then reduces task interference through decorrelation."
Dynamic merging (aka MoE-like methods): Composition approaches that route inputs to specialized modules at inference, typically requiring routers and larger storage. "Dynamic merging (aka MoE-like methods)~\citep{tang2024merging,huang2024emr,lu2024twin} requires the dynamic loading of task-specific modules based on test inputs, involving training routers or prior knowledge."
Frobenius norm: A matrix norm defined as the square root of the sum of the squares of all entries; here used for task vector magnitudes. "The Frobenius norm equals the sum of squared singular values $\|\boldsymbol{\tau}_{i,l}\|_F^2 = \sum_{j=1}^{r}\sigma_j^2$ ."
Implicit regularization: The tendency of certain optimization procedures (like SGD) to favor solutions with particular generalization properties without explicit penalties. "Notably, SGD provides implicit regularization~\citep{smith2021on,wang2022does}, constraining task vector optimization and navigating flat regions induced by null spaces."
Iso-C: An isotropic merging method that flattens singular spectra to improve alignment of components across tasks. "Iso-C~\citep{marczak2025no} proposes an isotropic merging framework that flattens the singular value spectrum of task matrices, and enhances alignment between singular components of task-specific and merged matrices."
LanguageBind: A video encoder framework for mapping video inputs into the LLM space. "The video-LLM employs LanguageBind~\citep{zhu2023languagebind} as the video encoder."
Layer-wise interference: Conflict between task vectors at a given layer that can degrade multi-task performance after merging. "They define layer-wise interference between the merged vector and task vector as $\boldsymbol{\tau}_{m,l}-\boldsymbol{\tau}_{i,l}$ for task $i$ at layer $l$ ."
Linear connectivity: A property where two models can be connected by a path of low loss in parameter space, implying mergeability. "The small task vector magnitudes suggest that fine-tuned models and base models exist in adjacent regions of the loss landscape with linear connectivity~\citep{wu2023pi}, facilitating effective model merging~\citep{ortiz-jimenez2023task}."
Linear interpolation methods: Techniques that combine models or task vectors by arithmetic operations like averaging and summation. "Linear interpolation methods: Weight Averaging~\citep{wortsman2022model} simply averages the weights of models fine-tuned on different tasks."
Linear subspace: A vector space structure indicating task vectors lie in a low-dimensional linear manifold related to training data. "WUDI Merging~\cite{cheng2025whoever} proves that task vectors $\boldsymbol{\tau}$ form an approximate linear subspace of the fine-tuning data $\boldsymbol{x}$ ."
Lipschitz continuous: A function whose differences are bounded by a constant times the input difference; used in loss bounds. "The loss on task $i$ is denoted by $\mathcal{L}_i(\Theta)$ , which is $\mathcal{C}_i$ -Lipschitz continuous."
LoRA: Low-Rank Adaptation; fine-tuning method that inserts low-rank updates into pretrained weights. "We choose two types of vision-LLMs: InternVL2.5 and Qwen2-VL, providing both LoRA and full fine-tuning checkpoints."
Loss landscape: The surface of the optimization objective over the parameter space, influencing mergeability and training behavior. "The small task vector magnitudes suggest that fine-tuned models and base models exist in adjacent regions of the loss landscape with linear connectivity~\citep{wu2023pi}, facilitating effective model merging~\citep{ortiz-jimenez2023task}."
Low-rank approximation: Representing a matrix with few singular components to reduce noise and redundancy. "To address this issue, we propose reducing inter-task interference through low-rank approximation."
Mixture training: Joint training on multiple datasets/tasks simultaneously, often used as a baseline for merged performance. "Our empirical results suggest that model merging can outperform mixture training"
Modality encoder: A specialized encoder for a particular input modality (vision, audio, video) that feeds into an LLM. "Moreover, most existing MLLMs specialize in dual modalities, and incorporating new modality encoders requires re-training on new modality-text data."
Multimodal LLMs (MLLMs): LLMs extended with non-text encoders and training to process multiple modalities. "Recently, Multimodal LLMs (MLLMs) that extend LLMs with broader capabilities through large-scale multimodal training have gained traction."
Null space: The subspace in which a linear operator maps vectors to zero; gradients vanish along these directions in low-rank settings. "When optimizing $\boldsymbol{\tau}_{m,l}$ , gradients become effective only in directions corresponding to non-zero singular values of $\boldsymbol{\tau}_{i,l}$ , while approaching zero in other directions (null space)."
Omni model alignment: Aligning a single model across many modalities and capabilities to function as an omni-model. "Our empirical results suggest that model merging can outperform mixture training, offering a viable path to omni-model alignment and a scalable approach to developing MLLMs with reduced computational cost and training time."
Omni-LLM: A model that unifies multiple modalities (vision, audio, video) via a shared language backbone. "Moreover, we explore how model merging can combine different modalities (, vision-language, audio-language, and video-LLMs), moving toward the Omni-LLM."
Optimization-based methods: Merging techniques that explicitly optimize a merged task vector via gradient descent on a defined loss. "Optimization-based methods: WUDI Merging~\cite{cheng2025whoever} proves that task vectors $\boldsymbol{\tau}$ form an approximate linear subspace of the fine-tuning data $\boldsymbol{x}$ ."
Orthogonal matrices: Matrices whose columns (and rows) are orthonormal, often used to decorrelate or rotate parameter spaces. "The method seeks orthogonal matrices $V_{\bot}$ and $U_{\bot}$ to reconstruct the parameters of the merged model."
Orthogonalization: Making vectors or components orthogonal to reduce interference, typically via rotations or projections. "TSV merging excels in modality merging because its orthogonalization mitigates modal conflicts"
PCA (Principal Components Analysis): A dimensionality reduction technique selecting components with maximal variance. "By truncating singular values, we preserve critical features $V_{1:k}^{\top}$ , which is similar to selecting principal components in Principal Components Analysis (PCA)~\citep{abdi2010principal}."
Parameter drift: The degree to which fine-tuning moves parameters away from the base model, affecting mergeability. "proving that merging performance is influenced by the learning rate and iterations, which control the extent of parameter drift."
Q-Former: A transformer-based module that learns queries to extract features from encoders for LLMs. "The audio-LLM adopts BEATs-Iter3+~\citep{chen2023beats} as the audio encoder, with a Q-Former as the connector."
Router: A learned or heuristic component that selects expert modules in dynamic/MoE merging. "Dynamic merging (aka MoE-like methods)~\citep{tang2024merging,huang2024emr,lu2024twin} requires the dynamic loading of task-specific modules based on test inputs, involving training routers or prior knowledge."
Singular value spectrum: The distribution of singular values across components, reflecting rank and energy of a matrix. "Iso-C~\citep{marczak2025no} proposes an isotropic merging framework that flattens the singular value spectrum of task matrices"
Singular values: Non-negative values from SVD indicating component magnitudes along principal directions. "The Frobenius norm equals the sum of squared singular values $\|\boldsymbol{\tau}_{i,l}\|_F^2 = \sum_{j=1}^{r}\sigma_j^2$ ."
Sparsification-based methods: Approaches that prune or sparsify task vectors to reduce redundancy and interference before merging. "Sparsification-based methods: Ties-Merging~\citep{yadav2023ties} combines steps like trimming, parameter sign determination, and disjoint merging to produce the $\boldsymbol{\tau}_m$ ."
Spectral structure: The pattern of singular values and vectors (the spectrum) that affects how SVD-based methods behave. "SVD-based methods are sensitive to the spectral structure of task vectors."
SVD (Singular Value Decomposition): A matrix factorization into orthogonal matrices and singular values used for low-rank modeling. "Next, we perform SVD to isolate core task-specific knowledge from noise present in the top and lower singular vectors"
Supervised fine-tuning (SFT): Training a pretrained model on labeled data to specialize its behavior. "For effective supervised fine-tuning, we gather at least 100k public dataset samples for each task"
Task Arithmetic: A linear interpolation method that sums task vectors to build a multi-task model. "Task Arithmetic~\citep{ilharcoediting} computes task vectors $\boldsymbol{\tau}_i = \boldsymbol{\theta}_i - \boldsymbol{\theta}_0$ for individual tasks"
Task vector: The parameter difference between a fine-tuned model and its base, representing task-specific updates. "Task vectors contain significant redundancy and noise, leading to mutual interference during merging."
Test-time adaptation: Adjusting model parameters using unlabeled test data to improve performance under distribution shifts. "Test-time adaptation~\citep{yang2024adamerging,yang2024representation,daheim2024model} assumes access to unlabeled test datasets"
TIES Merging: A sparsification-based method that trims and merges task vectors with sign handling to reduce conflicts. "Ties-Merging~\citep{yadav2023ties} combines steps like trimming, parameter sign determination, and disjoint merging to produce the $\boldsymbol{\tau}_m$ ."
Truncated SVD: Approximating a matrix by keeping only the top-k singular components to reduce rank and noise. "We apply a direct low-rank approximation to $\boldsymbol{\tau}_{i,l}$ using truncated $\mathrm{SVD}(\boldsymbol{\tau}_{i,l}) \approx U_{1:k}\Sigma_{1:k}V_{1:k}^\top$ without centering."
TSV Merging: An SVD-based method that quantifies and reduces singular task interference via orthogonalization/decoupling. "TSV Merging~\citep{gargiulo2024task} quantifies task-specific feature overlap in weight space by measuring the singular task interference of $\boldsymbol{\tau}_i$ ."
Video-LLM: A model that encodes video inputs and interfaces with an LLM for text understanding and generation. "The video-LLM employs LanguageBind~\citep{zhu2023languagebind} as the video encoder."
Vision-LLM: A model that combines a vision encoder with an LLM to process images and text jointly. "We choose two types of vision-LLMs: InternVL2.5 and Qwen2-VL, providing both LoRA and full fine-tuning checkpoints."
Weight Averaging: Averaging parameters of multiple fine-tuned models to obtain a single merged model. "Weight Averaging~\citep{wortsman2022model} simply averages the weights of models fine-tuned on different tasks."
WUDI Merging: An optimization-based method that minimizes layer-wise interference using task vectors to implicitly leverage training data. "WUDI Merging~\cite{cheng2025whoever} proves that task vectors $\boldsymbol{\tau}$ form an approximate linear subspace of the fine-tuning data $\boldsymbol{x}$ ."

View Paper Prompt View All Prompts

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Generate Now

OptMerge: Unifying Multimodal LLM Capabilities and Modalities via Model Merging

Summary

OptMerge: Unifying Multimodal LLM Capabilities via Model Merging

Introduction and Motivation

Model Merging Algorithms and Task Vector Theory

OptMerge Methodology: Robust Optimization of Task Vectors

Benchmark Construction and Evaluation Protocol

Empirical Findings and Strong Numerical Claims

Practical and Theoretical Implications

Conclusion

Paper to Video (Beta)

Whiteboard

Paper Prompts

Top Community Prompts

Explain it Like I'm 14

What is this paper about?

What questions does the paper try to answer?

How did the researchers approach the problem?

What did they find and why does it matter?

What’s the impact?

Quick recap

Glossary

Open Problems

Continue Learning

Collections

Don't miss out on important new AI/ML research

OptMerge: Unifying Multimodal LLM Capabilities and Modalities via Model Merging

Summary

OptMerge: Unifying Multimodal LLM Capabilities via Model Merging

Introduction and Motivation

Model Merging Algorithms and Task Vector Theory

OptMerge Methodology: Robust Optimization of Task Vectors

Benchmark Construction and Evaluation Protocol

Empirical Findings and Strong Numerical Claims

Practical and Theoretical Implications

Conclusion

Paper to Video (Beta)

Whiteboard

Paper Prompts

Top Community Prompts

Explain it Like I'm 14

What is this paper about?

What questions does the paper try to answer?

How did the researchers approach the problem?

What did they find and why does it matter?

What’s the impact?

Quick recap

Glossary

Open Problems

Continue Learning

Collections

Don't miss out on important new AI/ML research

Sign up for free to explore the frontiers of research