Predictive Merge Operator Selection

Updated 16 January 2026

The paper presents a predictive merge operator selection framework (SimMerge) that eliminates the need for exhaustively testing all model combinations, thereby reducing computational costs.
It leverages both probe-based signals (e.g., next-token KL divergence, activation cosine similarity) and structural weight metrics to accurately predict operator performance and merge order.
Empirical results demonstrate that SimMerge achieves superior gap-closure and risk mitigation, generalizing well to larger models and diverse tasks compared to fixed operators.

Predictive merge operator selection is an approach aimed at optimizing the composition of multiple LLMs by algorithmically choosing the merge operator, involved model subsets, and merge order based on predictive signals rather than expensive post-hoc evaluations. This methodology enables scalable model merging—producing a single checkpoint that integrates knowledge across diverse specializations—while controlling computational and human-in-the-loop costs. The predictive approach contrasts sharply with the traditional “merge-and-evaluate” paradigm, where all candidate combinations are tested exhaustively to identify optimal configurations.

1. Problem Formulation and Limitations of Merge-and-Evaluate

Given a catalog of $k$ domain-specialized LLM checkpoints $\{m_1,\dots,m_k\}$ , each parameterized by $\theta(m)\in\mathbb{R}^d$ , the task is to merge a selected subset $S$ into a new model $\tilde m$ that performs robustly across tasks $\mathcal{T}$ . The conventional procedure (“merge-and-evaluate”) involves selecting $S$ , an ordering $\pi$ over $S$ , and a sequence of merge operators $\mathbf o$ . Each configuration—characterized by $|S|! \times |\mathcal{O}|^{|S|-1}$ combinations—requires one or more resource-intensive downstream evaluations. This combinatorial explosion rapidly makes brute-force search intractable as $|S|$ and $|\mathcal{O}|$ grow (Bolton et al., 14 Jan 2026).

Predictive selection addresses this limitation by eliminating the need to merge and evaluate every configuration. Instead, the optimal combination is predicted upfront using cheap, task-agnostic similarity features computed prior to merging.

2. Overview of the Predictive Merge Selection Pipeline (SimMerge)

The SimMerge framework operationalizes predictive merge operator selection via the following steps:

Probe Collection: For each task $t$ , sample an unlabeled probe set $\mathcal{P}_t$ from the task’s input distribution.
Feature Extraction: Compute a similarity feature vector $x(m_a, m_b, t) \in \mathbb{R}^m$ for each ordered model pair and task, aggregating both functional (probe-driven) and structural (weight-based) features.
Operator Selector Training: Offline, for observed $(m_a, m_b, t)$ and operator $o \in \mathcal{O}$ , perform the merge and measure downstream performance, yielding $(x, o^*)$ pairs for supervised learning.
Multiway Plan Scorer: Collect per-step features for multi-model merge sequences, using these to regress final merged model performance.
Deployment: At inference, apply trained selectors to predict merge operators and merge sequences for arbitrary catalog subsets and previously unseen tasks, without remerging or reevaluating alternatives.

The core workflow is summarized by the algorithmic sequence in Table 1.

Pipeline Step	Inputs	Output/Action
Probe Collection	Tasks $\mathcal{T}$ , checkpoints	Probes $\mathcal{P}_t$
Feature Extraction	Model pairs, tasks, probes	Similarity feature vectors
Selector Training	Features, evaluated merges	Predictive operator selector
Plan Scoring	Multiway feature concatenations	Merge order predictions
Deployment	Arbitrary model subsets, tasks	Executed merges by prediction

3. Similarity Signals for Predictive Selection

Predictive merge operator selection leverages two primary categories of similarity signals:

Functional (Probe-Based) Features:

Next-token KL Divergence: $\mathrm{KL}_{\mathrm{mean}}(a, b, t)$ is computed by averaging per-position KL divergences between predicted token distributions on probe examples under teacher forcing.
Activation Cosine Similarity: Cosine similarity is computed between flattened hidden activations for corresponding layers across models, aggregated over probes.
Attention-Pattern Cosine: Cosine similarity between attention matrices across corresponding heads and layers.

Structural (Weight-Based) Features:

Weight Cosine: $\cos_W(a, b)$ , the cosine similarity of flattened parameter vectors (optionally by layer).
$\ell_2$ Distance: $d_W(a, b) = \|\theta_a - \theta_b\|_2$ , with per-layer and aggregate summaries.
Weight Norms: Norm statistics for each model.

All sequence-derived metrics (e.g., across layers or heads) are summarized by mean, median, and quantiles, yielding a fixed-length vector $x(m_a, m_b, t)$ per model pair and task (Bolton et al., 14 Jan 2026).

4. Architecture of the Predictive Selection Models

SimMerge uses supervised architectures tailored to both pairwise and multiway merge planning.

Pairwise Operator Selector: A two-layer MLP with ReLU activations receives the feature vector (plus optional task embedding $c(t)$ ), predicts operator probabilities via softmax, and selects the optimal merge operator,

$\hat o = \arg\max_{o} p_o.$

Training is performed with cross-entropy loss against the observed best-performing operator for each feature-target triplet.

Multiway Plan Scorer: For multi-step merges, feature blocks for each pairwise step (including propagated similarity metrics for intermediate merges) are concatenated and scored via an MLP, whose output regresses the empirical performance of the resulting merge sequence.
Extension to Multiway (without Retraining): The encoder, trained solely on 2-way data, scales to $k$ -way plans at test time by leveraging similarity metric propagation using convexity and triangle-inequality bounds.

An online bandit variant is also supported: at each step, a contextual bandit interprets pairwise similarity features as round context, applying Thompson Sampling (LinTS) for arm (operator) selection. The feature map is frozen post-warmup, allowing for continual adaptation to novel tasks, models, and operators without full retraining.

5. Empirical Evaluation and Results

SimMerge was evaluated using “Command-A 7B” and “Command-A 111B” LLMs across Code, Math, Multilingual QA, and RAG domains, using catalogs with up to 85 unique 7B and 18 111B checkpoints. Pairwise, 3-way, and 4-way merges were considered, with baselines including fixed Linear, SLERP, and TIES operators.

Metrics include:

Normalized Improvements: $\Delta_{\mathrm{exp}},\,\Delta_{\mathrm{aux}}$ (performance relative to expert and auxiliary models, respectively).
Gap-Closed: $100 \times \frac{s(m,t) - s_{\mathrm{aux}}}{s_{\mathrm{exp}} - s_{\mathrm{aux}}}$ , quantifying the proportion of the performance gap spanned by the merged model.

Key findings:

Pairwise merges (7B): SimMerge closes 65.0% of the expert-auxiliary gap, outperforming Linear (41.8%) and all other fixed operators. For example, in Math, TIES achieves 83.3% gap-closed, while SimMerge reaches 94.3%. In all domains, SimMerge improves over auxiliaries by 29–34% (vs. 7–15% for fixed operators) and suffers markedly less degradation from experts (degrading by <3% vs. up to 10% for fixed).
Multiway merges (7B): For 3-way merges, SimMerge achieves $\Delta_{\mathrm{aux}}=+17.7\%$ , outperforming Linear ( $+13.5\%$ ). In 4-way merges, SimMerge also consistently achieves superior performance.
Merge order: Learned order via SimMerge increases gap closure by up to 47% relative to random order in Code.
Transfer to 111B: Models trained on 7B merges generalize to 111B without retraining, achieving $\Delta_{\mathrm{aux}}=+42.7\%$ , substantially better than best-fixed Linear (+34.8%).
Online adaptation: The LinTS bandit variant rapidly matches near-oracle performance and outperforms uniform random selection by 10–20 percentage points on $\Delta_{\mathrm{aux}}$ and $\Delta_{\mathrm{exp}}$ .

6. Analysis, Ablations, and Feature Insights

Ablations reveal:

Task Encoding: Including learned task embeddings $c(t)$ improves pairwise selector accuracy by $\sim$ 2% across operator classes, e.g., Linear accuracy increases from 85.2% to 87.5%.
Feature correlations: Operator performance correlates with specific signal regimes:
- High KL divergence: favors SLERP.
- High weight cosine: favors TIES.
- High $\ell_2$ distance: favors Linear.
- Percentile-sorted feature bins confirm that operator efficacy is regime-dependent, and negative-tail outcomes (catastrophic mergers) are largely abated by SimMerge’s instance-conditional routing.
Operator selection as risk mitigation: Fixed operators are susceptible to catastrophic performance in adversarial feature regimes, whereas SimMerge consistently routes away from these failure modes.

7. Broader Significance and Practical Implications

Predictive merge operator selection, as operationalized by SimMerge, enables robust scaling of model merging in large checkpoint catalogs and limited evaluation budgets. By eliminating the exhaustive merge-and-evaluate loop in favor of fast, feature-based selectors, this approach provides substantial gains in merge quality, extensibility to novel tasks and much larger models, and support for online adaptation as models and domains continually evolve (Bolton et al., 14 Jan 2026). This suggests predictive operator selection may be essential for scalable model composition in modern LLM workflows, particularly as the field trends toward modular, compositional, and continual learning regimes.

Markdown Upgrade to Chat

References (1)

SimMerge: Learning to Select Merge Operators from Similarity Signals (2026)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Predictive Merge Operator Selection.