OptMerge: Post-Hoc Fusion for Models & Code

Updated 7 December 2025

OptMerge is a framework of algorithms that post-hoc merges specialized neural models, optimizers, and compiler functions without accessing original training data.
It leverages techniques such as curvature-aware aggregation, saliency-based sparsification, and low-rank projections to minimize destructive interference and enhance performance.
The approach enables scalable, data-free integration across multimodal models and optimization pipelines, yielding improved generalization and resource efficiency.

OptMerge refers to a family of algorithms and frameworks used to merge different sources of neural network knowledge, optimization logic, or executable code, by post-hoc fusion of independently optimized entities. It encompasses multiple streams of research, notably in model merging for LLMs and multimodal models, optimizer amalgamation, and compiler-level global function merging. All these methodologies share the goal of synthesizing a superior or more broadly capable artifact from specialized constituents, prioritizing data-free, scalable, and robust integration.

1. Model Merging: Theoretical Principles and Objectives

Model merging—central to most uses of OptMerge—focuses on fusing several expert neural models, each fine-tuned or otherwise specialized, into a unified model that consolidates their capabilities. Given a shared backbone $\theta_0$ and a set of expert checkpoints $\{\theta_k\}_{k=1}^K$ , the merged model $\theta_m$ is constructed by combining the task vectors $\tau_k = \theta_k - \theta_0$ in a principled fashion to produce $\theta_m = \theta_0 + \tau_m$ . This approach forgoes retraining and obviates access to original fine-tuning data, making it attractive for both resource efficiency and privacy (Wei et al., 26 May 2025, Wang et al., 17 Feb 2025, Mahdavinia et al., 14 Sep 2025).

The objectives of OptMerge-style merging are:

Combine and retain the specialized skills of different experts.
Minimize destructive interference or knowledge overwriting.
Achieve generalization comparable to or surpassing multi-task training.
Scale to multiple modalities or disparate task distributions.

2. Curvature- and Saliency-Aware Weight Space Fusion

Recent OptMerge variants leverage second-order information and parameter saliency to guide the merging process. Two key strategies emerge:

2.1 Curvature-Aware Aggregation

Optimization Trajectory Aware (OTA) Merging utilizes optimizer second-moment statistics $v_{\tau,i}$ (the Adam "exp_avg_sq" terms) to form a diagonal curvature proxy $C_\tau = \mathrm{diag}(v_\tau)$ . This curvature quantifies the local sensitivity of loss with respect to parameter changes, and thus informs the aggregation weighting. OTA solves

$\Delta w_{\rm merged} = \arg\min_\Delta \sum_{\tau=1}^T \|\Delta - \Delta w'_\tau \|_{C_\tau}^2,$

yielding the closed-form

$\Delta w_{\rm merged} = \left(\sum_{\tau=1}^T C_\tau\right)^{-1} \sum_{\tau=1}^T C_\tau\,\Delta w'_\tau,$

enabling elementwise, curvature-weighted updates that mitigate task interference (Mahdavinia et al., 14 Sep 2025).

2.2 Saliency-Based Sparsification

Methods like Fast Fisher Grafting (FFG) and Optimal Brain Iterative Merging (OBIM) compute per-parameter saliency using empirical Fisher or Hessian-diagonal proxies,

$s_{\tau,i} = (\Delta w_{\tau,i})^2 v_{\tau,i} \quad\text{or}\quad s_i = \frac{1}{2} h_{ii} \delta_i^2,$

and apply masks that keep only the most influential weights per expert. OBIM further enforces mutually exclusive ownership across experts: each weight is allocated to the model maximizing $s_i$ , thus avoiding destructive averaging of conflicting updates (Wang et al., 17 Feb 2025, Mahdavinia et al., 14 Sep 2025).

3. Low-Rank and Structured Merging Approaches

OptMerge and related frameworks exploit the empirical observation that expert fine-tuning updates (task vectors) are intrinsically low-rank and highly structured:

Low-rank projections via SVD (e.g., in vision-language fusion tasks) denoise task vectors before merging, preserving only directions supported by multiple experts (Wei et al., 26 May 2025).
Block-structured sparsity emerges naturally in FFG and OBIM, where nonzero weights concentrate in critical attention/value channels or early embedding layers, yielding implicit rank reduction (Mahdavinia et al., 14 Sep 2025).
For memory efficiency, curvature proxies may be compressed using factorized (e.g., rank-1) approximations without appreciable performance drops (Mahdavinia et al., 14 Sep 2025).

Table: Comparison of Representative OptMerge-Based Model Merging Algorithms

Algorithm	Key Technique	Handling Task Interference
OTA + FFG	Curvature-weighted merge,	Curvature-guided weighting,
(Mahdavinia et al., 14 Sep 2025)	Fisher-based sparsity masking	structured saliency masks
OptMerge (MLLM)	SVD denoising, WUDI objective	Low-rank deconfounding
(Wei et al., 26 May 2025)	SGD stability	careful initialization
OBIM	Saliency, iterative masks	Disjoint allocation, no averaging
(Wang et al., 17 Feb 2025)	Layerwise MSE saliency	mutually exclusive merge

4. Optimizer Amalgamation: Unifying Multiple Update Rules

In "Optimizer Amalgamation," OptMerge denotes a meta-learning framework where a "student" optimizer $P_\phi$ is trained to distill and blend the strengths of multiple "teacher" optimizers $\{T_k\}$ (Huang et al., 2022). The amalgamation objective combines a meta-loss that rewards fast progress on the optimizee with a distillation loss penalizing distance between the student and each teacher's parameter trajectory: $\mathcal{L}(\phi) = \mathcal{L}_\text{meta}(\phi) + \alpha \mathcal{L}_\text{amalg}(\phi).$ Amalgamation mechanisms include mean/sum, min-max, or a learned "soft-gate" that convexly selects teacher updates at each step and distills the resultant trajectory into $P_\phi$ . Meta-training stability is enhanced by Gaussian or adversarial perturbations in the student’s weight space, reducing meta-variance without degrading performance.

Empirical results demonstrate that student optimizers trained via optimal choice amalgamation with random perturbation ("OptMerge-Choice") can outperform all teacher optimizers and prior learned optimizer baselines across diverse tasks (Huang et al., 2022).

5. Application to Multimodal and Modular Models

OptMerge is leveraged to merge both intra-modality capabilities (e.g., VQA, geometry, OCR) and inter-modality experts (vision-language, audio-language, video-language) into a unified multimodal LLM (Wei et al., 26 May 2025). The benchmark includes full fine-tuning and LoRA/adapter-based mergers, covering combinations across five vision-language domains, as well as audio and video connectors.

Key observations:

Modality merging yields merged performance exceeding any constituent expert, due to complementary knowledge—e.g., merged models achieve 67.00% on AVQA/MUSIC-AVQA, surpassing vision-only (63.16%), audio-only (37.75%), or video-only (64.11%) models.
Ablations show significant cumulative improvements from better initialization and low-rank denoising in the OptMerge algorithm.

This suggests that data-free post-hoc merging can approach and in some cases exceed joint multi-task training, with vastly reduced resource requirements.

6. Compiler-Level OptMerge: Optimistic Global Function Merging

Beyond neural models, "OptMerge" also describes an optimistic global function merger for code-size reduction in compilers (Lee et al., 2023). Here, per-function summaries (stable hashes, parameterizable operand maps) are recorded and exchanged across compilation units, allowing modules to independently instantiate merged, parameterized template functions. Mismatches are avoided by only emitting merged functions and trampoline thunks, never rewriting original call sites.

Combined with global function outlining, OptMerge provides up to 3.5% additional binary size reduction on top of 17.3% from global outlining in mobile applications, with negligible build-time overhead.

The approach is robust to distributed and cached builds, as it never assumes IR stability and always leaves the original function intact.

7. Limitations, Practical Considerations, and Future Directions

Limitations

All OptMerge family merging methods assume identically initialized backbones and model architectures.
Masking and low-rank heuristics require careful calibration to avoid catastrophic knowledge erasure or underspecified merges.
Most current benchmarks use English-centric, relatively small models (<7B parameters); generalization to large multilingual or highly divergent architectures is open.
Saliency-based methods (e.g., OBIM) require small validation sets per expert for optimal performance, though forward-only and label-free passes suffice.

Future Trajectories

Extension to larger, more diverse expert sets and mixed architectures.
Application of learned profitability/importance heuristics, possibly via meta-learning or AI-guided parametric policies.
Combinations of OptMerge-based model merging with dynamic routing, mixture-of-experts, or lifelong continual learning paradigms.
Integration of low-rank, structured sparsification at the compiler level and synergy with advanced outliner/merger pipelines.

OptMerge, in its various guises, forms a unifying principle in post-hoc model, optimizer, and code merging, balancing expressivity, computational efficiency, and robustness across application domains (Mahdavinia et al., 14 Sep 2025, Wei et al., 26 May 2025, Huang et al., 2022, Lee et al., 2023, Wang et al., 17 Feb 2025).