Binary Splitting Algorithms
- Binary splitting algorithms are recursive methods that partition datasets into two subgroups using objective functions to minimize loss or detect changes.
- They are applied in sequential data segmentation, adaptive group testing, and clustering by efficiently selecting split points based on loss reduction or statistical discrepancies.
- The algorithm exhibits near-optimal performance with complexities ranging from O(NK) in worst-case scenarios to O(N log K) in best-case cases, ensuring practical scalability.
Binary splitting algorithms constitute a family of recursive procedures that partition a set, sequence, or population into subgroups via successive binary divisions, often according to an objective function or task-specific criterion. The two most prominent scenarios employing binary splitting are: (1) greedy binary segmentation of sequential data (used in changepoint detection, decision trees, etc.), and (2) combinatorial group testing (identification of defectives in a population with minimal tests). In recent years, binary splitting has further been extended and applied to clustering of functional data as well as to non-adaptive group testing, with algorithmic innovations focused on efficiency, theoretical guarantees, and memory usage (Hocking, 2024, Yao et al., 2023, Price et al., 2020, Chakrabarty et al., 15 Jul 2025).
1. Classical Binary Segmentation in Sequential Data
Binary segmentation operates on a one-dimensional sequence of ordered observations, iteratively seeking up to split points to minimize a segment-wise loss (e.g., squared error, Poisson loss). Each iteration selects and splits the current "best" segment, that is, the one yielding maximal decrease in the loss criterion. To ensure practical segment sizes, a minimum segment length is enforced throughout.
The algorithm maintains a priority queue of splittable segments, each with its optimal split and associated loss decrease. At each iteration, it:
- Removes the segment with the maximal potential loss decrease.
- Searches for the optimal split point in for segment (where is the segment length) by evaluating candidate splits.
- Splits the segment at , handles its children according to segment length and potential for further splitting, and updates the container.
- Stops when splits have been performed or no segments remain splittable.
High-level pseudocode:
1 2 3 4 5 6 7 8 9 10 11 |
Input: data x[1..N], max splits K, min length m, loss ℓ Initialize container C = { [1..N] with its best-split score } I ← 0 while I < K and C not empty: (score, segment [a,b]) ← pop_max(C) find c* ∈ {a+m … b-m} minimizing ℓ-split-loss([a,b],c) Create children [a,c*], [c*+1,b] For each child of length ≥ 2m: compute its best‐split loss and insert into C I ← I+1 Output: set of split locations |
The computational cost is driven by the number of candidate loss evaluations required at each stage (Hocking, 2024).
2. Complexity Analysis and Bounds
Binary segmentation's computational footprint is characterized by the number of candidate split point evaluations. Let denote the number of admissible splits in a segment of length .
- Worst-case complexity: , giving scaling. This occurs when splits are constantly unbalanced, leading to many small segments.
- Best-case complexity: Achievable when splits are perfectly balanced, resulting in
and yielding asymptotically.
- Space complexity: In the worst case, up to segments must be tracked; in the best case, only .
Dynamic programming can be used to compute exact finite-sample counts, especially for benchmarking and empirical performance evaluation.
Synthetic sequences constructed to induce maximally balanced or unbalanced splits are used to validate tightness of the complexity analysis (Hocking, 2024).
3. Binary Splitting in Adaptive and Non-Adaptive Group Testing
In group testing, binary splitting (often called "binary splitting algorithm" or BSA) seeks to identify all defective items from a population of using as few group tests as possible. Each test returns positive if the group contains at least one defective, negative otherwise.
Binary Splitting (Adaptive):
- At each round, split the current candidate set into two halves and test one half.
- If negative, discard that half; if positive, recurse on that half.
- Once a singleton is isolated and tested positive, it is declared defective and removed.
- The sequence repeats until all defectives are found.
Pseudocode outline:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 |
Procedure BINARY_SPLIT(A): DEFECTIVES ← ∅ REMAIN ← A while REMAIN ≠ ∅ do X ← REMAIN while |X| > 1 do split X into X1, X2 of roughly equal size if TEST(X1) == POSITIVE then X ← X1 else X ← X2 end while if TEST(X) == POSITIVE then DEFECTIVES ← DEFECTIVES ∪ X end if REMAIN ← REMAIN \ ( DEFECTIVES ∪ {tested negatives} ) end while return DEFECTIVES |
Key properties:
- Test complexity: ; that is, .
- The number of stages equals the number of tests, as the algorithm is fully sequential.
- Theoretical guarantee: identifies all defectives with zero error, without requiring prior knowledge of .
- Limiting factor: number of stages grows linearly with , so adaptivity depth can become (Yao et al., 2023).
Group Testing Extensions:
Non-adaptive binary splitting-inspired algorithms have been developed, leveraging randomized test design and hierarchical decoding to achieve scaling in test count and decoding time, with small probability of error. Storage-optimized versions use hashed group assignments at each recursive level (Price et al., 2020).
4. Recursive Binary Splitting in Clustering and Functional Data Analysis
For data clustering, recursive binary splitting has been adapted to partition sets of observations into clusters such that inter-cluster separation is maximized under a criterion such as the Maximum Mean Discrepancy (MMD).
- At each step, the current set is divided into two via a greedy binary-splitting procedure optimizing the weighted MMD,
where denotes the kernel MMD, and are the sizes of subsets and .
- When the number of clusters is unknown, the algorithm applies a data-driven "single cluster check" (SCC) at each division to determine whether ongoing splitting is warranted, based on the structure of the separation curve.
- For a specified , the algorithm alternates splitting and merging (based on distances) to stabilize at precisely clusters.
Theoretical results in the oracle regime (component laws known) establish that, under mild regularity, the recursive binary splitting algorithm achieves perfect clustering (unknown case, CURBS-I) or the perfect order preserving (POP) property (fixed , CURBS-II), with computational cost (Chakrabarty et al., 15 Jul 2025).
5. Practical and Empirical Performance
Empirical studies on large real-world genomic datasets demonstrate that, when deployed in binary segmentation for changepoint detection, the observed scaling in candidate split computations is , tightly matching the best-case theoretical behavior. This phenomenon persists even on sequences with highly heterogeneous structure and varying segment lengths. On synthetic "hard" instances constructed to realize worst-case complexity, empirical results confirm the tightness of upper bounds . For group testing, binary splitting algorithms consistently match or approach information-theoretic lower bounds in the sparse regime () (Hocking, 2024, Yao et al., 2023).
For clustering, recursive binary splitting using MMD distances yields near-perfect or perfect identification of clusters in simulated and real functional datasets across a range of models (location and scale differences), outperforming previous state-of-the-art competitors (Chakrabarty et al., 15 Jul 2025).
6. Algorithmic Variants and Comparison
Key algorithmic distinctions and variants observed in the literature include:
| Domain | Loss/objective | Adaptivity | Test/Comp. Complexity | Guarantees |
|---|---|---|---|---|
| Seq. Data, Changepoints | Segment-wise loss | Sequential | worst/ best | Empirical near-best case |
| Group Testing (Adaptive) | Defective recovery | Fully adaptive | tests, stages | Zero-error, unknown |
| Group Testing (Nonadapt.) | Defective recovery | Non-adaptive | tests, decode | Small error, low storage |
| Clustering (MMD) | Max. sep. MMD | Recursive | Perfect/POP oracle (fixed ) |
Best-case complexity arises when splits are balanced (e.g., data with central ties or block-structured separations), while worst-case complexity occurs for highly imbalanced or adversarial data (Hocking, 2024, Yao et al., 2023, Chakrabarty et al., 15 Jul 2025). In group testing, newer algorithms such as diagonal splitting (DSA) achieve information-theoretically optimal tests while reducing adaptivity depth to , at the cost of a small increase in total tests compared to BSA (Yao et al., 2023).
7. Theoretical and Practical Considerations
Binary splitting offers strong theoretical guarantees, often achieving near-optimal complexity in both synthetic and practical settings. Its adaptivity, greedy nature, and tree-structured behavior underpin its ubiquity in segmentation, group testing, and clustering tasks. The algorithm handles arbitrary loss functions and leverages priority containers or queues to efficiently manage split candidates.
Empirically, binary splitting frequently realizes best-case runtimes, with rare occurrences of worst-case scaling. Extensions leveraging hashing, randomized test assignment, and kernel-based splitting have broadened applicability to high-dimensional, functional, and streaming data domains.
For all main applications, no knowledge of the number of target segments, defectives, or clusters is required a priori, and the algorithms adapt online to data structure and complexity (Hocking, 2024, Yao et al., 2023, Price et al., 2020, Chakrabarty et al., 15 Jul 2025).