Chain of Merges (CoM): Methods & Applications
- Chain of Merges (CoM) is a versatile framework that applies sequential, structure-preserving merge operations to integrate model parameters, sampling algorithms, and combinatorial structures.
- Layer-wise neural network merging under CoM leverages auto-regressive updates and closed-form least-squares to maintain activation statistics and minimize covariate shift.
- Applications of CoM span LLM fusion, reversible MCMC redistricting, and lattice enumeration, emphasizing efficiency via local masking and exact, reversible computations.
Chain of Merges (CoM) refers to a class of methods and mathematical frameworks in which the merging of atomic units—whether weights in neural networks, combinatorial chains, or substructures in sampling algorithms—is performed as a sequence or hierarchy of merge operations. The “Chain of Merges” paradigm appears in diverse contexts, including LLM parameter fusion, layer-wise model aggregation, reversible Markov chain Monte Carlo (MCMC) algorithms for partition sampling, and the enumeration of combinatorial merges of ordered sets. Each usage exploits the principle of progressive, structure-preserving merges to solve complex problems of integration, sampling, or enumeration.
1. Layer-wise Model Merging: CoM for Deep Network Aggregation
The layer-wise Chain of Merges approach is motivated by the problem of merging multiple fine-tuned models with identical architectures into a single unified model, while minimizing the degradation resulting from cross-layer covariate shift (Buzzega et al., 29 Aug 2025). Prior activation-based merging methods treat each layer independently, leading to “merging covariate shift” (MCS): after merging early layers, their output distributions—serving as inputs to downstream layers—no longer match the unmerged statistics, compounding mismatches and impairing performance.
The CoM procedure addresses MCS by updating activation statistics in a strictly auto-regressive (causal) fashion throughout the merge process. At each layer , the merged weights are obtained via a closed-form least-squares regression that minimizes
where are activations produced from the (already) partially merged lower layers. The recursive update ensures that for every layer, the merging loss and Gram matrices are computed using the input distribution that will be encountered at inference. Sensitivity weighting, activation normalization, and Tikhonov regularization are optional enhancements to improve numerical conditioning and sample efficiency.
Empirically, CoM demonstrates substantial improvements: with as few as two samples per client, the merged models achieve 91.7% mean normalized accuracy on vision tasks (ViT-B/32), compared to 70.7% for the best prior baseline, and similarly significant gains in NLU domains. Ablation studies confirm the centrality of the auto-regressive update—removing this reduces performance by over 20 points (Buzzega et al., 29 Aug 2025). Theoretical analysis shows that only the causal CoM update eliminates MCS by maintaining globally consistent input/output statistics layer-wise.
2. Selective Merging in Chain-of-Thought LLMs: RCP-Merging
In the context of LLMs combining multi-step chain-of-thought (CoT) reasoning capability and strong domain expertise, the Chain of Merges philosophy is instantiated by RCP-Merging (Reasoning Capability as Prior Merging) (Yang et al., 5 Aug 2025). This approach fuses an LLM trained for long CoT reasoning (“reasoning model”) with a domain-specific LLM (“domain model”) while avoiding catastrophic forgetting (e.g., gibberish or output collapse):
- The reasoning model’s weights serve as a Bayesian prior, with the Fisher Information Matrix (FIM) —computed via calibration data—quantifying the CoT importance for each parameter.
- The domain model’s parameter importance is measured using loss gradients on domain data.
- For each parameter, a reasoning-preservation indicator and a domain-sensitivity score are combined into a conflict score. Only domain weights whose updates do not significantly interfere with high-FIM reasoning directions are merged, producing a binary mask used for the final merge.
This selective mask-based merge process preserves the backbone of CoT-relevant parameters while allowing essential domain-contributions. Extensive benchmarking demonstrates that RCP-Merging achieves state-of-the-art combined performance, surpassing prior merging schemes by 9.5% in BioMedicine and 9.2% in Finance (compared to best baselines), and maintaining output diversity and fluency while avoiding collapse (Yang et al., 5 Aug 2025). The method is efficient, requiring only 100–500 unlabeled calibration samples per domain/reasoning, and is orders of magnitude cheaper than full joint finetuning.
3. Multi-Scale Merge-Split MCMC: CoM for Redistricting Plan Sampling
Chain of Merges is also foundational to the Multi-Scale Merge-Split Markov chain developed for sampling redistricting plans under population-balance and community-contiguity constraints (Autry et al., 2020). The state space consists of partitions of the base graph into districts, each represented by hierarchical spanning trees, and a set of linking edges connecting adjacent districts. The algorithm alternately:
- Merges two districts by joining their trees and resampling a hierarchical tree on the union (using random spanning tree techniques and multi-scale expansions).
- Splits the merged district along a “cuttable” edge, creating two new districts.
By maintaining a forest-of-hierarchical-trees state augmented with linking edges, the CoM sampler ensures each move is reversible and has a closed-form transition probability, remedying irreversibility and computational inefficiency in earlier ReCom-type algorithms. The multiscale hierarchy allows algorithmic cost to scale logarithmically in the base graph size. Mixing and scaling experiments confirm that the CoM Markov chain converges (under mild connectivity assumptions) and is computationally viable at the full precinct or even block level (Autry et al., 2020).
4. Combinatorial Mergings of Chains: Enumeration and Lattice Structure
CoM also denotes a combinatorial structure: the set of all “proper mergings” of two disjoint chains into a single quasi-order that extends each chain’s original order without identification of elements across chains (Mühle, 2012). Formally, for chains and , a proper merging can be encoded as a pair of relations satisfying precise product-containment and exclusivity conditions, guaranteeing the integrity of and ’s original orderings.
The set of proper mergings is in bijection with plane partitions of size with entries in , imposing a natural distributive lattice structure. Enumeration is given by MacMahon’s formula:
Specializations recover Catalan-like and Narayana numbers. Proper mergings for which correspond bijectively to Galois connections between chains, and the enumeration of such connections is classical.
5. Theoretical Significance and Unifying Structural Principles
Despite their disparate applications, all CoM frameworks exhibit certain shared principles:
- Hierarchical/Sequential Merging: Each variant performs merges as a chain, where the output of one operation informs or constrains subsequent merges—whether recursively (neural network layers), structurally (spanning forests), or order-preservingly (combinatorial chains).
- Preservation of Structure: The merge process is designed to conserve key system properties—reasoning capability in LLM weights, activation/statistics alignment in deep models, contiguity and balance in spatial redistricting, and original chain orderings in poset enumeration.
- Efficiency via Locality/Masking: By restricting merges via masking (LLM setting), block-wise operations (multi-scale MCMC), or structural constraints (combinatorial bonds), the computational and informational overhead is sharply reduced compared to naive or global merges.
- Exactness, Reversibility, or Closed-form Computation: Many CoM algorithms possess exact update rules (least-squares in networks, closed-form Metropolis-Hastings transitions in MCMC, explicit enumeration in combinatorics), supporting stability, analytical tractability, and robustness.
The adoption of a Chain of Merges philosophy thus provides rigorous frameworks for integrating heterogeneous or hierarchical structures while retaining essential features, enabling new advances in model merging, graph partition sampling, and combinatorial enumeration.