Split-then-Merge: Algorithmic Frameworks

Updated 29 November 2025

Split-then-Merge (StM) is a framework that partitions and combines data entities using criteria like density and likelihood to enable scalable and accurate modeling.
It is applied in unsupervised learning, trajectory clustering, and Bayesian inference, where explicit split and merge rules optimize computation and improve stability.
The methodology offers theoretical guarantees and enhanced mixing times, with demonstrated effectiveness in MCMC samplers, consensus protocols, and large-scale inference.

Split-then-Merge (StM) denotes a family of algorithmic frameworks and stochastic processes based on the fundamental operations of splitting and merging entities—clusters, trajectories, clusters, partitions, Markov chain states, or physical ensembles—in order to achieve robust inference, efficient computation, or accurate modeling of complex systems. This paradigm is widely utilized in unsupervised learning, probabilistic modeling, trajectory analysis, combinatorial optimization, consensus protocols, and random process theory. The canonical workflow alternates between splitting composite structures into more granular subunits and merging compatible subunits into coherent aggregates, leveraging problem-specific criteria (density, likelihood, similarity, etc.) to guide these transitions.

1. Mathematical Formulation and General Principles

Split-then-Merge frameworks are characterized by a state-space of composite entities (e.g., clusters, cycle partitions, trajectories), with dynamics orchestrated using explicit rules for split and merge events. The system’s evolution is either deterministic (e.g., sequential graph optimization), or stochastic (Markov chain or MCMC). At each step, two primitives are invoked:

Split: Partition an entity into sub-entities based on a criterion (density, anomaly, similarity, likelihood drop). Merge: Coalesce two (or more) entities if a compatibility criterion is met (density-connection, proximity, model evidence gain).

For example, in the Markovian split-and-merge process for ranked partitions $p=(p_1,p_2,...)$ of unit mass (Ioffe et al., 2019):

Merge: Blocks $p_i, p_j$ coalesce at rate $p_i\,p_j$ .
Split: Block $p_i$ splits into $u\,p_i$ and $(1-u)p_i$ (uniform $u\in[0,1]$ ) at rate $p_i^2$ .

The infinitesimal generator $L$ acts on test functions $f$ by:

$L\,f(p) = \sum_{i<j} p_i\,p_j [f(M_{ij}p)-f(p)] + \sum_i p_i^2 \int_{0}^1 [f(S_i^u p) - f(p)]\,du$

Analogous split-merge primitives operate in discrete clustering and inference frameworks, typically guided by density-based or model-evidence-based rules.

2. StM in Trajectory Clustering

In stable trajectory clustering (Rahmani et al., 30 Apr 2025), the StM paradigm is applied to piecewise-linear trajectories $\pi_k = (\overrightarrow{P_k^1P_k^2}, ...)$ , which are decomposed into line segments for clustering:

Split phase: For each cluster at time $i-1$ , if its line segments infringe the DBSCAN density-connectedness at time $i$ , the cluster is split according to a local density criterion:

Core segment if $|N_\varepsilon(l)|\ge\mathrm{MinLns}$ ,
Split occurs if density-reachability fails between subsets; labeling as "dense" or "low-density" proceeds.

Merge phase: After splitting, any two clusters $c_a, c_b$ are merged if they contain at least one pair of segments satisfying $\mathrm{dist}(l,l')\le\varepsilon$ .

Clustering histories $C_{hist}[i]$ are maintained for each interval, enabling whole-trajectory clustering (trajectories share membership in all intervals) or sub-trajectory clustering (sliding window over $C_{hist}$ ). Stability is handled by post-processing: trajectories labeled as outliers due to temporary deviations are reassessed via the mean absolute deviation; those with sufficiently small deviation counts and magnitudes are re-merged to their nearest cluster.

Algorithmic complexity scales as $O(n^2\,m)$ for whole-trajectory clustering, and $O((m/S)\,n^2 W)$ for sub-trajectory analysis.

3. Split-Merge Algorithms in Bayesian Inference and Clustering

StM algorithms are fundamental to advanced MCMC samplers for Bayesian mixture models and nonparametric inference:

3.1 Particle Gibbs Split-Merge MCMC

The PGSM sampler (Bouchard-Côté et al., 2015) constructs split-merge transitions via a conditional SMC sweep over potential partitions: two anchor points are sampled; all data in their clusters is considered for split/merge by a particle Gibbs scan satisfying detailed balance without requiring complex Metropolis-Hastings ratios. Forward kernels implement partition proposals; merges occur directly and splits use a reference particle path and adaptive resampling.

3.2 StM in Dirichlet Process and Hierarchical Models

Split-merge MCMC algorithms accelerate mixing for Dirichlet processes and HDP topic models (Wang et al., 2012, Duan et al., 2018, Peixoto, 2020) by proposing moves that reassign all points in a candidate cluster (split) or combine clusters (merge), using model evidence and prior factors to construct valid acceptance ratios. For hierarchical Dirichlet processes, splits and merges operate at the table/topic assignment level, with restricted Gibbs sweeps and sequential allocation to obtain proposal probabilities:

Split acceptance:

$A_{\mathrm{split}} = \min\left\{1, \frac{p(c')L(c')q(c'\to c)}{p(c)L(c)q(c\to c')}\right\}$

3.3 Scaling Up StM via Locality-Sensitive Sampling

MinHash-based proposals (Luo et al., 2018) enable informative and efficient split-merge MCMC moves at sublinear computational cost, exploiting collision probability calculations for cluster separation. This achieves favorable scaling on extremely large datasets, yielding convergence speedup factors up to $6\times$ compared to restricted-Gibbs approaches.

4. Split-then-Merge for Trajectory and Formation Planning

The StM paradigm is employed in multi-agent and robot formation path planning (Pereyra et al., 2019, Pereyra et al., 2019). Here, the problem involves finding optimal paths for a robot formation across a graph, enabling splits and merges to minimize cost subject to physical, collision, and synchronization constraints.

States encode not only node position but robot count, and each edge is associated with a cost vector $c^e = (c^e_1,\ldots,c^e_R)$ . As robots traverse the graph, sub-formations may split to take distinct branches; upon reconverging, merge operations synchronize movements and update cost labels. Determinism and completeness are ensured; all splits and merges are explicitly tracked, and the extended Dijkstra algorithm is guaranteed optimal.

5. StM in Dictionary Learning and Matrix Problems

The split-then-merge approach supports scalable learning in high-dimensional settings:

Dictionary learning: (Mukherjee et al., 2014) Large data $Y$ is split into $L$ blocks, each block is used to train a local dictionary; then the local dictionaries are merged into a global dictionary via a second-stage sparse learning step. The methodology reduces memory footprint and per-iteration computational requirements with minimal decrease in representation accuracy and denoising performance, leveraging a compositional sparsity lemma.

Generalized eigenvalue problems: (Liu et al., 3 Jul 2025) StM accelerates convergence of first-order optimization for GEP via difference-based objective minimization. At each iteration, a quadratic surrogate is split (curvature majorization) and then merged (closed-form update), enabling line-search-free gradient descent that avoids saddle points and outperforms classical eigenvalue solvers in runtime and stability.

6. StM as a Framework for Evaluation and Markov Processes

6.1 Split-Merge for Comparing Clusterings

The StM framework for clustering comparison (Xiang et al., 2012) models cluster relationships as bipartite graphs, decomposing comparison into split and merge subcomponents. Split subgraphs dissect true clusters; merge subgraphs aggregate predicted clusters. Componentwise scores (e.g., conditional entropy-based) deliver normalized, strictly monotonic measures of similarity with guaranteed [0,1] range, outperforming classical indices (Rand, NMI, etc.) in monotonicity and normalization across divergent clusterings.

6.2 Canonical Split-and-Merge Markov Process

The stationary cycle-length process of the random stirring model on the lattice torus converges to the canonical infinite-dimensional Markovian StM process (Ioffe et al., 2019). Splits and merges of cycle blocks are realized via jump rates proportional to the block sizes; the invariant distribution is the Poisson–Dirichlet law $\mathrm{PD}(1)$ , identified with stick-breaking or Poisson-point constructions. The process is reversible, and serves as a scaling limit in numerous probabilistic models, including random permutations and quantum spin systems.

7. Extensions, Variants, and Applications

StM methods are further extended in hierarchical stochastic block models (Peixoto, 2020), layer-aware generative video composition (Kara et al., 25 Nov 2025), and self-contained consensus reconfiguration protocols (Xiong et al., 21 Apr 2025). The video composition framework employs unsupervised split (foreground/background segmentation) and merge (diffusion-based composition) stages, with specialized losses for identity and harmony preservation; the Raft protocol leverages epoch-based split and merge steps with joint and constituent quorums for cluster reconfiguration, ensuring safety and liveness.

The split-then-merge paradigm’s flexibility enables design of robust, scalable, and interpretable algorithms across clustering, trajectory analysis, large-scale inference, combinatorial optimization, and consensus systems. Theoretical guarantees and empirical benchmarks consistently reveal improved mixing times, stability, and computational efficiency over single-site or monolithic approaches.