SimMerge: Learning to Select Merge Operators from Similarity Signals

Published 14 Jan 2026 in cs.LG and cs.AI | (2601.09473v1)

Abstract: Model merging enables multiple LLMs to be combined into a single model while preserving performance. This makes it a valuable tool in LLM development, offering a competitive alternative to multi-task training. However, merging can be difficult at scale, as successful merging requires choosing the right merge operator, selecting the right models, and merging them in the right order. This often leads researchers to run expensive merge-and-evaluate searches to select the best merge. In this work, we provide an alternative by introducing \simmerge{}, \emph{a predictive merge-selection method} that selects the best merge using inexpensive, task-agnostic similarity signals between models. From a small set of unlabeled probes, we compute functional and structural features and use them to predict the performance of a given 2-way merge. Using these predictions, \simmerge{} selects the best merge operator, the subset of models to merge, and the merge order, eliminating the expensive merge-and-evaluate loop. We demonstrate that we surpass standard merge-operator performance on 2-way merges of 7B-parameter LLMs, and that \simmerge{} generalizes to multi-way merges and 111B-parameter LLM merges without retraining. Additionally, we present a bandit variant that supports adding new tasks, models, and operators on the fly. Our results suggest that learning how to merge is a practical route to scalable model composition when checkpoint catalogs are large and evaluation budgets are tight.

Abstract PDF Upgrade to Chat

Summary

The paper introduces a predictive framework that uses similarity signals to select and order merge operators, eliminating costly merge-and-evaluate loops.
It leverages lightweight probe passes and similarity metrics to compute functional and structural features for optimal operator prediction across various tasks.
Empirical results demonstrate significant performance improvements, robust multiway merging, and successful generalization across domains and model scales.

SimMerge: Predictive Merge Operator Selection via Model Similarity Signals

Introduction and Motivation

The proliferation of domain-specialized LLM checkpoints has made post-hoc model merging an essential tool for scalable model composition, supporting flexible, on-demand aggregation of capabilities with minimal retraining. Traditionally, merging relies on fixed weight-space operators (e.g., Linear, SLERP, TIES), often demanding costly merge-and-evaluate searches to select the optimal operator and merge order. However, the interaction between merge operator performance and model/task regimes is highly context-dependent, and naïve uniform application of merge rules often results in suboptimal, brittle outcomes. The paper "SimMerge: Learning to Select Merge Operators from Similarity Signals" (2601.09473) introduces a learned, predictive merge-selection strategy that leverages inexpensive functional and structural similarity measures to select the best merge operator and merge order without downstream evaluation, addressing reliability and scalability challenges for large checkpoint catalogs.

Figure 1: Overview of SimMerge. Given domain-specialized checkpoints and small unlabeled probe sets, SimMerge computes similarity signals, predicts merge operator and order, and produces a merged model—all in a single pass.

Methodology

Similarity-Based Feature Engineering

SimMerge computes a suite of pre-merge similarity features for each ordered checkpoint pair and task, extracted via lightweight probe passes and direct parameter comparisons. Functional features include layer-aggregated KL divergence between model predictions and cosine similarity of hidden activations/attention scores on probes. Structural features comprise parameter cosine similarity, $\ell_2$ distance, and modulewise norms. Statistical summaries (mean, median, quantiles) are concatenated with task encodings when available, yielding fixed-dimensional context vectors for operator selection.

Predictive Offline Operator Selection

For pairwise merges, SimMerge trains an MLP classifier to predict the optimal merge operator from similarity features, using logged downstream utility (performance on held-out benchmarks) as supervision per operator/model pair. At inference, only the model weights and a small probe set are required; no labels or evaluation are performed, enabling instantaneous selection and single-pass merging. For multi-way merges, an analogous plan scorer evaluates candidate merge permutations via concatenated pairwise feature blocks, extending the approach to $k$ -way merges without additional supervision.

Efficient Multi-Way Plan Scoring

To avoid recomputation for intermediate merged states, SimMerge approximates similarity measures for hypothetical merges using convex combinations and norm-based inequalities (e.g., upper bounds for KL divergence, triangle inequality for $\ell_2$ distance), capitalizing on the closed-form propagation for mixtures.

Contextual Bandit Extension

Recognizing the dynamism of real-world model and task pools, SimMerge provides an online adaptation mechanism via neural-linear contextual bandits, supporting Thompson sampling (LinTS) and UCB-style policies atop the frozen MLP encoder. The bandit is warm-started with offline data and then updated with partial feedback as new checkpoints and tasks arrive, ensuring continual selection quality under distribution shift.

Empirical Results

Pairwise Merge Performance

SimMerge achieves strong improvements over fixed operators in pairwise expert-auxiliary merges across domains. When evaluated on the percentage of the expert-auxiliary gap closed (GapClosed), SimMerge closes 65.0% of the gap macro-averaged, outperforming the best fixed operator (Linear, 41.8%) and exhibiting robustness across code, math, multilingual, and RAG tasks. Notably, fixed operators demonstrate high sensitivity to domain: Linear excels in code, TIES in math, yet both can regress below the auxiliary on other tasks. SimMerge avoids such failures by instance-specific selection.

Figure 2: Fraction of expert–auxiliary gap closed by merge methods; SimMerge consistently closes more of the gap than fixed baselines.

Per-task changes show that SimMerge simultaneously limits degradation vs. experts and maximizes gains over auxiliaries, never underperforming the auxiliary baseline and maintaining near-expert performance across domains.

Figure 3: Per-task performance changes for each merge method: SimMerge yields the largest auxiliary improvements and minimal expert degradation.

Multi-Way Merging and Order Selection

Extending to 3- and 4-way merges, SimMerge maintains the best expert/auxiliary trade-off as merge complexity increases, demonstrating transferability of its 2-way training to higher-order planning. The number of configurations grows factorially with number of merit models, yet SimMerge's plan scoring circumvents exhaustive evaluation.

Figure 4: Relative performance (vs. expert, vs. auxiliary) for 2-, 3-, and 4-way merges: SimMerge consistently outperforms alternatives at all merge sizes.

Merge order, critical for non-associative operators, is selected via similarity-driven scoring, yielding substantial improvements over random ordering baselines, especially in code and RAG tasks.

Figure 5: Effect of merge order on gap closed for 3-way merges; SimMerge’s plan scoring substantially boosts performance relative to random ordering.

Scalability: Transfer to 111B-Parameter Models

A key claim is that SimMerge generalizes across model scales. The selector trained at 7B applies directly to 111B for 3-way merges, delivering the lowest expert-relative degradation ( $-7.8\%$ ) and largest auxiliary gain ( $+42.7\%$ ), highlighting the generalizability of similarity-based selection mechanisms.

Figure 6: SimMerge yields the smallest expert degradation and largest auxiliary improvements in 111B 3-way merges, despite being trained at 7B.

Online Adaptation

Bandit-based SimMerge quickly matches oracle performance in operator selection, minimizing cumulative regret even under distribution shift with new tasks and checkpoints. LinTS outperforms random and LinUCB policies and achieves near-oracle performance on downstream utility.

Figure 7: Bandit-based operator selection rapidly learns optimal policies, minimizing cumulative regret against the oracle.

Figure 8: Online SimMerge (LinTS) achieves strong expert retention and auxiliary improvement in 3-way merges across domains.

Technical Insights into Similarity-Regime Dependence

Operator efficacy is tightly linked to the structural and functional similarity regime between checkpoints. Correlation and win-probability trends reveal that Linear, SLERP, and TIES have orthogonal regions of success, and fixed rules result in heavy-tailed failures outside their alignment regime. SimMerge’s per-instance selection systematically avoids unfavorable tails, producing robust merged models even as composition complexity grows.

Figure 9: Spearman correlations between similarity metrics and operator selection—distinct metrics favor different operators, validating regime-dependent selection.

Figure 10: Tail effects across similarity regimes—SimMerge’s adaptive selection mitigates severe failures found in fixed operator approaches.

Implications and Future Directions

SimMerge reorients the model merging paradigm from inventing new operators to learning when to optimally deploy existing ones. This establishes a practical route for scalable checkpoint composition, reducing evaluation costs and enabling robust aggregation of large catalogs under tight computational budgets. The empirical transferability across tasks, domains, and model scales, coupled with fast online adaptation, suggests that similarity-conditional selection frameworks can underpin next-generation model deployment pipelines. Extending the operator set, integrating more sophisticated intermediate propagation, and advancing task-agnostic selection are promising avenues, especially for heterogeneously fine-tuned or non-homogeneous architectures.

Conclusion

SimMerge introduces a predictive, similarity-signal-driven framework for merge operator selection and merge ordering in LLM checkpoint composition. It eliminates expensive merge-and-evaluate loops, generalizes robustly to large-scale and multiway merges, and adapts online to streaming tasks and models. The results validate that similarity-dependent selection is essential for reliable model merging—outperforming fixed operators in both numerical performance and regime robustness. SimMerge thus offers scalable, on-demand model composition infrastructure, with immediate relevance for practical deployment and future AI system design.

Markdown

Paper to Video (Beta)

No one has generated a video about this paper yet.

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Paper Prompts

Top Community Prompts

Explain it Like I'm 14

off on

Knowledge Gaps

off on

Practical Applications

off on

Glossary

off on

Conceptual Simplification

off on

Explain it Like I'm 14

What is this paper about?

This paper is about a smart way to combine different specialized AI LLMs into one stronger model. Instead of blindly trying lots of ways to mix models (which is slow and expensive), the authors build a tool called SimMerge that quickly predicts the best way to merge models by checking how similar they are—before doing the merge.

The big idea in one line

SimMerge doesn’t invent a new mixing method; it learns when to use existing mixing methods by reading “similarity signals” between models, saving time and compute while keeping performance high.

What were the researchers trying to find out?

In simple terms, they wanted to solve three questions that come up when you have many fine‑tuned models (for code, math, languages, etc.) and you want one model that can do a lot:

Which models should we combine?
Which “mixing rule” (merge operator) should we use for each pair?
If we combine more than two models, in what order should we merge them?

They asked: Can we predict these choices cheaply and accurately using signals that don’t require labeled data or lots of test runs?

How did they approach the problem?

Think of each model like a student with a specialty—one’s great at math, another at coding, another at reading many languages. Combining them is like creating a study group. But how you combine them matters: some methods blend better than others, and the order you add students to the group can affect the final result.

SimMerge works in three main steps:

It measures similarity between models using quick, inexpensive checks:
- Functional similarity: Give both models the same small set of inputs (like a mini, unlabeled quiz) and compare their answers. If they answer similarly, they’re functionally close.
- Structural similarity: Compare the “shape” of their internal settings (their parameters/weights), a bit like seeing how similar two engines are on the inside.
- These measurements include distances and angles between weight vectors and how similar the models’ output distributions are—don’t worry about the math names; they’re just ways to measure “how alike” two models are.
It trains a small predictor to choose a merge operator:
- Merge operators are different ways to mix two sets of model parameters, like:
- Linear interpolation: a simple halfway blend.
- SLERP: a “curved” blend that keeps the overall scale consistent.
- TIES: a rule that tries to reduce interference by aligning signs and trimming unhelpful parts.
- SimMerge learns, from past examples, which operator works best given the similarity signals for a model pair.
It plans the order for multi‑model merges:
- Merging isn’t “order-insensitive.” Like cooking, adding ingredients in a different order can change the outcome.
- SimMerge scores different merge orders using the same similarity features and picks the best plan—without testing every possibility.
- It also includes an online “bandit” version (think of it like trying options and learning from feedback) that adapts when new models or tasks appear.

Why this is efficient: All the similarity signals come from small unlabeled “probe” inputs and simple weight comparisons—no full training runs or large evaluations needed. That makes it much cheaper than brute-force “try everything and evaluate.”

What did they find?

Here are the main results and why they matter:

It beats fixed rules for 2‑model merges:
- Across tasks like code, math, multilingual, and retrieval (RAG), SimMerge consistently chose better operators than always using the same rule.
- On average, it recovered 65% of the performance gap between off‑task models and the task expert, compared to 41.8% for the best single fixed operator. This means it got much closer to expert-level performance without expensive search.
It works for merging 3 and 4 models too:
- Even though the selector was trained only on 2‑model cases, it generalized well to 3‑ and 4‑way merges by scoring merge plans from pairwise similarities.
- As merging more models naturally gets harder, SimMerge still kept the best balance: smaller drop from the expert model and bigger gain over non-experts.
Order matters—and SimMerge picks better orders:
- Using the same operators but a smarter order, SimMerge closed notably more of the expert gap than random orderings (e.g., +47% more on code, +21% on RAG). So choosing the right sequence really helps.
It scales to much larger models without retraining:
- A selector trained on 7‑billion‑parameter models worked on 111‑billion‑parameter models and still outperformed fixed rules—showing strong transfer.
It adapts online when things change:
- The bandit variant quickly learned to pick good operators as new tasks/models were added, getting close to an “oracle” that always knows the best choice.

Why does this matter?

Saves a lot of time and compute: Instead of merging and testing many combinations, SimMerge predicts good choices upfront using cheap signals.
More reliable model building: As organizations collect many specialized models, they need fast, scalable ways to combine them. SimMerge makes that practical.
Flexible and future‑proof: It can add new merge operators, tasks, or models without starting from scratch, especially with the online bandit version.
Better performance with fewer surprises: Because it adapts per case, it avoids the “one rule fits all” pitfalls where a single method works great on one task but harms another.

In short, SimMerge shows that “learning how to merge” models—using simple similarity checks—can reliably create strong, multi‑skill AI systems without the usual trial‑and‑error grind.

View Paper Prompt View All Prompts

Knowledge Gaps

Knowledge gaps, limitations, and open questions

Below is a single, focused list of what remains missing, uncertain, or unexplored in the paper. Each item is specific and actionable for future researchers.

Generalization beyond a shared base: The method is only evaluated on checkpoints sharing the same pretrained base and architecture (Command-A 7B/111B). It remains unclear how SimMerge behaves when merging across different base families, architectures, tokenizer vocabularies, or training pipelines (e.g., LLaMA/Mistral vs. Command-A; different RoPE settings; adapter-based models).
Operator coverage is narrow: Selection is limited to {Linear, SLERP, TIES}. Performance and feasibility with additional operators (e.g., EMR, ZipIt, Fisher merging, permutation/alignment-based operators, LoRA adapter merges) are not tested. How does SimMerge scale (sample complexity, feature requirements, accuracy) as the number and diversity of operators increases?
Mixing coefficient fixed at α=0.5: The approach does not tune α per merge, per layer, or per operator. What is the joint benefit of learning α alongside operator selection, and does per-layer α improve outcomes? How does operator choice interact with α and with per-module mixing?
Intermediate-merge feature propagation accuracy: Multi-way plan scoring relies on approximate propagation of similarity metrics for intermediate merges (convex bounds, triangle inequality). The accuracy of these approximations vs. recomputation is not quantified, especially for non-linear operators (e.g., TIES sign selection, SLERP geometry) and α≠0.5.
Scalability to large catalogs and k≫4: The method is only evaluated up to 4-way merges. For catalogs with hundreds or thousands of checkpoints, computing pairwise features is O(N²), and plan enumeration grows factorially. What selection heuristics, pruning strategies, or submodular formulations can make plan scoring scalable for large k?
Probe-set dependence and robustness: Similarity features depend on unlabeled probes drawn from each task’s input distribution (10k tokens). Sensitivity to probe size, domain mismatch, selection bias, privacy constraints, and distribution shift is not characterized. What are the minimal probe sizes, cross-task probe reuse strategies, and robustness under noisy/misaligned probes?
Multi-task objective weighting: Plan selection aggregates predicted utility via macro-averaging over tasks, which may favor majority tasks or dilute minority-task objectives. How should task weights or Pareto-optimal objectives be set, and can SimMerge support constrained optimization (e.g., minimum acceptable performance on specified tasks)?
Feature ablation and interpretability: The paper lacks a thorough, public ablation of feature channels (logit KL, activation cosine, weight cosine, attention similarity, norms). Which features drive operator/order selection per domain, and how robust are selections to feature noise or adversarial perturbations?
Per-layer/operator granularity: Operator choice is made per merge step, not per layer/module. Would per-layer operator selection (or operator mixtures within a step) yield better results, and can the similarity signals support that finer granularity?
Handling permutation misalignment: The selection does not explicitly address neuron/permutation alignment between finetuned models. How does SimMerge perform when models are substantially permuted, and can alignment-aware features or pre-alignment steps improve selection?
Bandit variant practical constraints: The online bandit requires downstream evaluation to obtain rewards, reintroducing evaluation costs. Open questions include handling delayed/noisy rewards, non-stationarity (task drift), cold-start for new operators/models, contextual covariate shift, and multi-objective rewards (across several tasks simultaneously).
Transfer across model families and scales: While 7B→111B transfer on the same base is promising, broader transfer (to different families, smaller/larger capacities, sparse/mixture-of-experts, quantized or pruned models) remains untested.
Task coverage gaps: Evaluations omit safety/toxicity, factuality, long-context, tool-use, dialog quality, calibration, and robustness (adversarial or OOD). How does merging affect these properties, and can features be augmented to anticipate safety/regulatory risks?
Selection cost vs. merge-and-evaluate: The paper does not provide end-to-end time/memory benchmarks of the selection pipeline (feature computation + plan scoring) vs. standard merge-and-evaluate searches, especially at large N and k. What are the cost curves and practical speed-ups?
Multi-way plan search strategy: The number of candidate plans scored at test time, sampling/enumeration policies, and their impact on quality vs. compute are not specified. What sampling strategies or learned proposal mechanisms best trade off performance and cost?
Operator hyperparameters: Operators like TIES often have tunable hyperparameters (e.g., thresholds/masks). SimMerge does not select/tune operator-specific hyperparameters. Can selection extend to per-operator hyperparameter optimization without reintroducing expensive search?
Reliability across seeds and settings: While task evaluations use three seeds, the sensitivity of selection outcomes to random seeds, temperature, decoding strategies, and feature computation settings is not analyzed. What reproducibility guarantees and confidence intervals can be provided?
Functional features for long generation: Logit KL and short-probe activations may not capture long-form generation behaviors (e.g., multi-step reasoning chains, code execution correctness beyond pass@1). What functional metrics better reflect long-horizon generation and composeability?
Fairness and cross-lingual interference: Merging could disproportionately harm minority languages or underrepresented domains. There is no analysis of cross-lingual interference, bias amplification, or fairness impacts. How should selection incorporate fairness constraints or per-language safeguards?
Public reproducibility: The paper references internal datasets and probe construction; code, data, and detailed protocols for feature computation, plan scoring, and bandit warm-start are not provided. Releasing artifacts would enable external validation and stress testing across diverse settings.

View Paper Prompt View All Prompts

Practical Applications

Immediate Applications

The following applications can be deployed now, leveraging SimMerge’s similarity-driven operator and merge-order selection to compose fine-tuned LLM checkpoints without expensive merge-and-evaluate loops.

Bold: Model catalog auto-composer for MLOps
- Description: Automatically select merge operators and orders to consolidate many domain-specialized checkpoints (e.g., code, math, multilingual, RAG) into a single, versatile model release.
- Sectors: Software, enterprise AI platforms
- Tools/workflows: Merge planner integrated with a checkpoint registry (e.g., internal catalog, Hugging Face Hub), pre-merge feature computation pipelines using 10k-token unlabeled probes per task, cached forward-pass outputs, one-shot merge execution
- Assumptions/Dependencies: Models share a common pretrained base/architecture; candidate operators available (Linear, SLERP, TIES); unlabeled probes representative of target tasks; licensing allows parameter-space composition; monitoring for performance regressions on new domains
Bold: Enterprise LLM consolidation (single model for multi-department use)
- Description: Compose department-specific fine-tunes (support, code-assist, multilingual QA, retrieval) into a unified production model to simplify deployment, ops, and governance.
- Sectors: Enterprise software, customer support, knowledge management
- Tools/workflows: CI/CD pipeline step that computes similarity signals, selects a multiway plan (order + operator per step), executes merges, and validates with spot checks
- Assumptions/Dependencies: Non-associativity requires careful order selection; equal-weight merges assumed; some performance trade-offs vs. per-domain experts; downstream spot-evaluation remains advisable
Bold: RAG + instruction specialists merge for enterprise search
- Description: Merge instruction-following and RAG specialists to improve retrieval-augmented answering without retraining multi-task models.
- Sectors: Information retrieval, enterprise search, compliance knowledge bases
- Tools/workflows: Probe-set construction from typical query distributions; similarity-feature caching per checkpoint; periodic recomposition as new RAG specialists are added
- Assumptions/Dependencies: RAG tasks require representative probes (queries and contexts); care with data privacy in probe selection; retrieval stack remains unchanged
Bold: Continuous task onboarding via bandit SimMerge
- Description: Use the contextual bandit variant to select merge operators online when new tasks/models appear, reducing regret vs. random choice and avoiding full retraining.
- Sectors: MLOps, platform teams
- Tools/workflows: Neural-linear bandit on top of learned features; warm-start from logged pairwise merges; LinTS policy for online adaptation under partial feedback
- Assumptions/Dependencies: Logged full-information pairwise data available; limited operator set; partial-feedback regime (no counterfactuals); ongoing minimal downstream evaluation for rewards
Bold: Academic baseline composition for multi-task studies
- Description: Replace multi-task training with low-cost composition of domain experts to create strong baselines across math, code, multilingual, and RAG.
- Sectors: Academia, research labs
- Tools/workflows: Reproducible pipelines (MergeKit + SimMerge); probe sets derived from public benchmarks; macro-averaged reporting across tasks
- Assumptions/Dependencies: Shared base checkpoints; benchmark-aligned probe sets; published artifacts allow parameter merging and re-distribution
Bold: Healthcare subdomain consolidation (clinical assistant)
- Description: Merge fine-tunes from subdomains (radiology report summarization, oncology QA, general clinical notes) into a single assistant to reduce model sprawl.
- Sectors: Healthcare
- Tools/workflows: De-identified, unlabeled probe sets per subdomain; strict audit of downstream performance; human-in-the-loop validation for safety-critical tasks
- Assumptions/Dependencies: Regulatory compliance (HIPAA/GDPR); domain shifts may require task encodings; safety and bias evaluation mandatory; legal permissions to merge checkpoints from different sources
Bold: Multilingual tutor composition for education
- Description: Merge math-reasoning and multilingual understanding checkpoints to deliver a multilingual, step-by-step tutoring assistant.
- Sectors: Education
- Tools/workflows: Task encodings for language coverage; plan scorer to choose order (to avoid high interference); periodic recomposition as curricula evolve
- Assumptions/Dependencies: Representative probes per language and subject; careful evaluation for low-resource languages; maintaining consistency in pedagogy tone/style
Bold: Finance coding assistant with domain QA
- Description: Combine code-generation specialists with financial QA models to support analytics scripting and compliance-aware explanations.
- Sectors: Finance, FinTech
- Tools/workflows: Probe sets from typical notebooks/queries; operator selection to balance reasoning and code robustness; governance checks for compliance and hallucination risks
- Assumptions/Dependencies: Proprietary data cannot be used for probes without safeguards; license compatibility; performance monitoring on high-stakes tasks
Bold: Government/NGO low-compute composition
- Description: Resource-efficient way for agencies to compose vendor-provided checkpoints into a single model without large evaluation budgets.
- Sectors: Public sector, policy
- Tools/workflows: Lightweight similarity-feature pipelines; merge plan selection using unlabeled task probes; minimal downstream validation
- Assumptions/Dependencies: IP/contractual rights to merge; clear documentation and audit trails; conservative deployment with staged rollouts
Bold: On-device/edge assistant consolidation
- Description: Merge small specialized models (home automation, calendar, shopping) into a single on-device assistant to reduce memory and update complexity.
- Sectors: Consumer apps, IoT
- Tools/workflows: Tiny probe sets reflecting local usage; compute-efficient feature extraction; offline merge execution before OTA updates
- Assumptions/Dependencies: Shared base architecture across small models; memory constraints favor equal-weight merges; careful privacy handling for probes

Long-Term Applications

The following applications are feasible but require further research, scaling, or development, such as broader operator support, multi-modal generalization, safety instrumentation, and governance frameworks.

Bold: AutoMerge platform integrated with model registries and orchestration
- Description: End-to-end “Merge as a Service” that scores candidate plans, selects orders/operators, manages probe sets, and produces versioned composite artifacts with lineage.
- Sectors: Software platforms, cloud AI
- Tools/products: Registry plugins (MLflow, Hugging Face Hub), UI for plan scoring, lineage tracking, cost/performance trade-off dashboards
- Assumptions/Dependencies: Rich operator catalogs (EMR, ZipIt, TIES variants), scalable feature caching, robust metric propagation for intermediate merges, enterprise-grade observability
Bold: Federated and cross-organization model composition
- Description: Compose models across institutions (e.g., hospital consortiums, educational cooperatives) without sharing raw data, using agreed probe protocols.
- Sectors: Healthcare, education, public sector
- Tools/products: Privacy-preserving probe exchange, audit logs, differential privacy for probe outputs, secure parameter-space composition
- Assumptions/Dependencies: Legal frameworks for cross-entity parameter merging; privacy guarantees; alignment of base architectures; governance over attribution and liability
Bold: Safety-aligned composition (capability + alignment specialists)
- Description: Merge capability experts with safety/alignment-tuned checkpoints while preserving guardrails, using similarity signals and safety-specific features.
- Sectors: All regulated domains
- Tools/products: Safety-aware plan scorer (features for toxicity, jailbreak resistance), evaluation harnesses for red-teaming, policy-driven merge approvals
- Assumptions/Dependencies: New similarity channels capturing safety metrics; formal risk thresholds; cross-domain validation; potential need for non-equal mixing or layer-selective merges
Bold: Multi-modal expansion (vision, speech, action) of similarity-driven merging
- Description: Extend SimMerge to VLMs and speech models with modality-specific similarity features and operators.
- Sectors: Robotics, accessibility tech, media
- Tools/products: Modality-aware probe sets (images, audio), multi-modal feature encoders, operator generalizations beyond LLMs
- Assumptions/Dependencies: Research on cross-modal parameter alignment, permutation matching, and functional metrics; careful treatment of attention patterns across modalities
Bold: Robotics policy and language-conditioned controller composition
- Description: Compose task-specific policies or language-conditioned controllers to expand skill repertoires without full multi-task retraining.
- Sectors: Robotics, industrial automation
- Tools/products: Policy-feature analogs (trajectory divergence, activation similarity), safe merge validators, sim-to-real test benches
- Assumptions/Dependencies: Operators and features adapted to policy networks; safety constraints (collision, compliance); verified order selection to avoid catastrophic interference
Bold: Energy and climate forecasting model composition
- Description: Merge domain forecasters (demand, weather, grid operations) to improve holistic planning and anomaly handling.
- Sectors: Energy, utilities
- Tools/products: Time-series probes, cross-model feature channels; cost-aware plan selection to meet latency and memory budgets
- Assumptions/Dependencies: Extend similarity signals to time-series architectures; new operators for non-LLM models; strong validation on rare-event regimes
Bold: Cost/latency-aware merge planning for production SLAs
- Description: Integrate resource constraints into plan scoring to choose merges that optimize accuracy and runtime/memory for specific deployment targets (edge vs. cloud).
- Sectors: Cloud, mobile, IoT
- Tools/products: Multi-objective plan scorer (accuracy, memory, latency), hardware-aware operators, deployment profilers
- Assumptions/Dependencies: Accurate cost models; operator-level performance profiles; potential non-equal α tuning; CI pipelines for A/B validation
Bold: Governance, audit, and IP policy for composed models
- Description: Establish standards to document derivations, licenses, data-use constraints, and safety attestations for merged checkpoints.
- Sectors: Policy, legal, compliance
- Tools/products: Model lineage watermarks, license compatibility checkers, audit trails, conformance tests
- Assumptions/Dependencies: Consensus on attribution norms; tooling to track provenance through weight-space merges; sector-specific regulations (e.g., medical device AI)
Bold: District-/enterprise-scale education assistants across curricula
- Description: Maintain hundreds of curriculum-specific fine-tunes and compose them into unified assistants per grade/subject with controlled interference via optimized orders.
- Sectors: Education
- Tools/products: Curriculum-aware probe libraries, plan selection services, teacher-in-the-loop evaluation portals
- Assumptions/Dependencies: Scaling plan scorer to large K; improved metric propagation; continual monitoring for pedagogical consistency and bias
Bold: Regulatory validation pipelines for safety-critical domains
- Description: Formalize pre-merge similarity thresholds and post-merge validation protocols as risk indicators and certification steps.
- Sectors: Healthcare, finance, aviation
- Tools/products: Validation suites mapping similarity scores to expected risk, standardized reporting, external certification hooks
- Assumptions/Dependencies: Empirical links between similarity signals and failure modes; sector standards; independent auditing capacity
Bold: Training pipeline re-architecture (compose instead of multi-task train)
- Description: Adopt merge-centric model development: train domain experts, then compose at scale to ship variants per market/task without re-training monoliths.
- Sectors: AI product companies, platforms
- Tools/products: AutoMerge schedulers, operator catalogs, routine recomposition workflows, artifact management and rollback
- Assumptions/Dependencies: Robustness under frequent recomposition; improved operators (associativity, interference control); systematic probe maintenance as tasks evolve

Notes on general feasibility:

SimMerge currently assumes shared pretrained bases and architectures, equal-weight merges, and a fixed operator set; extensions may require new operators, non-equal mixing, or layer-selective strategies.
Unlabeled probe sets are critical; their representativeness and privacy/safety implications must be addressed per domain.
Non-associativity makes order selection central; multiway scoring relies on metric propagation approximations that should be validated in edge cases.
Even with prediction-based selection, minimal downstream evaluation is recommended for high-stakes deployments.

View Paper Prompt View All Prompts

Glossary

Adam optimization: A gradient-based optimizer that adapts learning rates per parameter using estimates of first and second moments. "We use a two-layer network with ReLU activations and Adam optimization."
Attention patterns: The structures of attention weights in transformer models that indicate how tokens attend to each other. "cosine similarity of weights and attention patterns"
Auxiliary model: A checkpoint not fine-tuned on the target task, merged to broaden capabilities. "We refer to any model that is not fine-tuned on t as an auxiliary (off-domain) model for task t."
Bayesian linear regression: A probabilistic linear model that maintains a posterior distribution over parameters, updated with observed data. "a Gaussian posterior $\mathcal{N}(\hat{w}_a,\Sigma_a)$ maintained via Bayesian linear regression."
Cauchy-Schwarz: An inequality in inner-product spaces used to bound similarities. "bounds derived from Cauchy-Schwarz."
Context vector: The feature representation describing a merge step for decision-making. "We observe a context vector $s \in \mathbb{R}^{m}$ derived from pre-merge similarity features."
Contextual bandit: A learning framework where an agent chooses actions based on context and observes only the chosen action’s reward. "We develop an online variant based on a contextual bandit that updates under partial feedback"
Convexity: A property of functions or sets used to derive bounds and mixtures. "convexity of $f$ -divergences yields upper bounds"
Cosine similarity: A measure of angle-based similarity between vectors (e.g., weights or activations). "cosine similarities between intermediate activations $h_a^{(\ell)}$ and $h_b^{(\ell)}$ "
Counterfactual rewards: The unobserved rewards of actions not taken in bandit settings. "counterfactual rewards for unchosen operators are not observed."
Cross-entropy loss: A classification loss measuring the divergence between predicted and true distributions. "and is trained with a cross-entropy loss to predict $y^\star$ ."
Distribution shift: A change in data or task distribution between training and deployment. "we introduce a distribution shift by adding a checkpoint trained on a different task that did not appear in the logged data."
Exact match: An evaluation metric where predictions must exactly match the ground truth. "Math is evaluated on MATH and GSM8K using exact match."
Equal-weight mixture: A combination of models or distributions with uniform mixing coefficients. "we treat the intermediate model as an equal-weight mixture of its two parents on the probe set"
Euclidean distance: The L2 norm distance between parameter vectors. "Euclidean distance $\|\theta(m_a)-\theta(m_b)\|_2$ "
f-divergences: A family of divergences measuring differences between probability distributions. "convexity of $f$ -divergences yields upper bounds"
GapClosed: A normalized metric indicating how much of the expert–auxiliary gap a merge recovers. "reported in terms of GapClosed."
Instruction-following: Tasks where models must comply with natural-language instructions. "In the instruction-following setting used for the bandit experiments, we report scores on IFEval"
Kullback–Leibler divergence: A measure of how one probability distribution diverges from another. "including the Kullback-Leibler divergence $D_{\mathrm{KL}(p_a \,\|\, p_b)$ between predictive distributions"
Layerwise: Operating per model layer rather than on the whole parameter vector at once. "We work with layerwise binary merge operators $M_o$ indexed by $o\in\mathcal{O}$ ."
LinTS: Linear Thompson sampling; a Bayesian bandit algorithm that samples from arm posteriors to balance exploration and exploitation. "We consider both LinUCB and linear Thompson sampling (LinTS)"
LinUCB: A linear upper-confidence-bound bandit algorithm that selects actions by optimistic estimates over uncertainty. "We consider both LinUCB and linear Thompson sampling (LinTS)"
Logits: Pre-softmax model outputs representing unnormalized log probabilities. "KL divergence between model logits, cosine similarity of weights and attention patterns"
Macro-averaging: Averaging metrics equally across tasks or classes to produce a single aggregate score. "using macro-averaging over $t\in\mathcal{T}$ ."
Merge-and-evaluate search: An empirical procedure that tries many merge configurations and evaluates each downstream. "run expensive merge-and-evaluate searches to select the best merge."
Merge operator: A rule that combines two models’ parameters into merged parameters. "Common merge operators are not associative"
Merge order: The sequence in which models are merged, affecting non-associative operators’ outcomes. "For $k>2$ the merge order affects the result"
Merge plan: An ordered sequence describing a k-way merge’s steps and operators. "a $k$ -way merge plan is an ordered sequence"
MLP: Multi-layer perceptron; a feedforward neural network used for scoring or classification. "In practice $f_{\mathrm{plan}$ is an MLP"
Multiway merging: Merging more than two models in sequence (e.g., 3-way, 4-way). "Predictive selection is especially valuable for multiway merging"
Neural-linear: A bandit architecture combining neural feature maps with linear posterior models for arms. "We adopt a neural-linear design."
Non-associative: An operation where grouping changes the result, so order matters. "Common merge operators are not associative"
Oracle policy: An idealized policy that always picks the best action with full information. "an oracle that, for each round, plays the best operator in hindsight given full knowledge of all utilities."
Parameter averaging: Combining model parameters by averaging, often used to merge checkpoints. "Early approaches such as simple parameter averaging and related weight-space operations"
Parameter space: The high-dimensional space of model parameters in which merging operates. "composing models in parameter space"
Pass@1: A code-generation metric measuring if the first attempt passes unit tests. "Code generation is evaluated with pass@1 on HumanEval_Python and MBPP+ benchmark"
Permutation alignment: Aligning neuron permutations across models to facilitate merging. "subspace matching and permutation alignment"
Probe set: A small unlabeled dataset drawn from a task’s input distribution used to compute similarity features. "we draw an unlabeled probe set $\mathcal{P}_t$ "
Probe tokens: The number of tokens in probe inputs used to compute functional similarity. "we use only 10000 probe tokens per task"
RAG: Retrieval-augmented generation; models use external documents retrieved at inference to produce answers. "across multilingual, code, math, and RAG"
Rank-one updates: Efficient posterior updates using low-rank corrections in Bayesian linear models. "posterior updates can be done with rank-one updates in $O(d^2)$ time per round"
ReLU activations: Rectified Linear Unit nonlinearity used in neural networks. "We use a two-layer network with ReLU activations"
Regret: The performance difference between the learner and an oracle over time. "minimizes regret with respect to an oracle"
Reward: The scalar utility observed after selecting a merge operator in bandit rounds. "observe a scalar reward $r(a)$ "
Sign-consistent merge: A merging method that enforces consistent weight signs across models. "Ties (trim, elect sign and merge) is a sign-consistent merge."
SLERP: Spherical linear interpolation; merges parameters along a geodesic on a hypersphere. "spherical linear interpolation (SLERP)"
Subspace matching: Aligning representational subspaces across models to enable more reliable merging. "subspace matching and permutation alignment"
Task-agnostic: Methods or features not tailored to a specific task, usable across tasks. "task-agnostic similarity measurements"
Thompson sampling: A Bayesian bandit strategy that samples from the posterior to select actions. "We consider both LinUCB and linear Thompson sampling (LinTS)"
TIES merging: A merge operator that trims, elects a sign, and merges to reduce interference. "SimMerge selects among linear interpolation, spherical linear interpolation (SLERP), and TIES merging."
Triangle inequality: A norm property used to bound distances when propagating metrics. "the triangle inequality gives"
Unlabeled probes: Data without ground-truth labels used to compute pre-merge similarity signals. "From a small set of unlabeled probes"
Warm-start: Initializing model parameters or posteriors using pre-collected data before online adaptation. "We initialize $g_\phi$ and warm-start it using the logged pairwise data"
Weight-based measures: Similarity metrics computed directly on model parameters rather than outputs. "We then compute weight-based measures that compare models directly in parameter space"
Win-rate: An evaluation metric indicating the percentage of test cases where a model’s output is judged better. "using win-rates and accuracy."

SimMerge: Learning to Select Merge Operators from Similarity Signals

Summary

SimMerge: Predictive Merge Operator Selection via Model Similarity Signals

Introduction and Motivation

Methodology

Similarity-Based Feature Engineering

Predictive Offline Operator Selection

Efficient Multi-Way Plan Scoring

Contextual Bandit Extension

Empirical Results

Pairwise Merge Performance

Multi-Way Merging and Order Selection

Scalability: Transfer to 111B-Parameter Models

Online Adaptation

Technical Insights into Similarity-Regime Dependence

Implications and Future Directions

Conclusion

Paper to Video (Beta)

Whiteboard

Paper Prompts

Top Community Prompts

Explain it Like I'm 14

What is this paper about?

The big idea in one line

What were the researchers trying to find out?

How did they approach the problem?

What did they find?

Why does this matter?

Knowledge Gaps

Knowledge gaps, limitations, and open questions

Practical Applications

Immediate Applications

Long-Term Applications

Glossary

Open Problems

Continue Learning

Authors (6)

Collections

Tweets

SimMerge: Learning to Select Merge Operators from Similarity Signals

Summary

SimMerge: Predictive Merge Operator Selection via Model Similarity Signals

Introduction and Motivation

Methodology

Similarity-Based Feature Engineering

Predictive Offline Operator Selection

Efficient Multi-Way Plan Scoring

Contextual Bandit Extension

Empirical Results

Pairwise Merge Performance

Multi-Way Merging and Order Selection

Scalability: Transfer to 111B-Parameter Models

Online Adaptation

Technical Insights into Similarity-Regime Dependence

Implications and Future Directions

Conclusion

Paper to Video (Beta)

Whiteboard

Paper Prompts

Top Community Prompts

Explain it Like I'm 14

What is this paper about?

The big idea in one line

What were the researchers trying to find out?

How did they approach the problem?

What did they find?

Why does this matter?

Knowledge Gaps

Knowledge gaps, limitations, and open questions

Practical Applications

Immediate Applications

Long-Term Applications

Glossary

Open Problems

Continue Learning

Related Papers

Authors (6)

Collections

Tweets