Non-Binary Dendrogram Algorithm
- Non-binary dendrogram algorithms are clustering methods that merge all tied clusters simultaneously, yielding unique and reproducible hierarchical structures.
- They extend classical Lance–Williams formulas to accommodate group merges and record fusion intervals, which faithfully capture within-cluster variability.
- These methods are widely applied in complex network analysis and genetic profiling to resolve ambiguities in data with many equidistant or tied observations.
Non-binary dendrogram algorithms, also referred to as variable-group agglomerative hierarchical clustering or multidendrograms, generalize classical hierarchical clustering by allowing the simultaneous merging of more than two clusters when ties occur in proximity data. Unlike pair-group (binary) methods, which merge one cluster pair per step and arbitrarily break ties, non-binary approaches construct unique, deterministic hierarchical structures even in the presence of multiple equidistant cluster pairs. This property is essential for reliable analysis of data sets—such as complex networks or genetic profiles—where discrete distance or similarity values induce frequent ties, resolving significant ambiguities inherent to standard methods (Gomez et al., 2012, Gomez et al., 2014, Fernández et al., 2023).
1. Mathematical Definition and General Formulation
Let be the set of objects to be clustered. At each stage, the current partition consists of clusters , represented by a symmetric dissimilarity matrix with and .
The core innovation in non-binary algorithms is that for any set of clusters that are mutually tied at the minimum inter-cluster distance, all are merged simultaneously into a supercluster . For any two multi-indexed clusters and , the inter-cluster distance is generalized via the “variable-group” Lance–Williams formula (Gomez et al., 2012, Fernández et al., 2023):
Here, and are, respectively, the minimal and maximal inter-object distances as appropriate for the linkage definition. The parameters , , , and are specified to recover the standard linkage criteria (single, complete, UPGMA, WPGMA, centroid, Ward) when only binary merges are performed.
2. Algorithmic Procedure and Pseudocode
The non-binary (variable-group) agglomerative clustering proceeds iteratively:
- Initialization: Each object forms its own cluster .
- Find global minimum: Compute .
- Detect ties: Identify all cluster pairs at , construct a graph with edges for each such pair, and find connected components (tie components).
- Merge tied clusters: Merge all clusters in each connected component simultaneously into a supercluster. Components consisting of single clusters remain unmerged.
- Record fusion intervals: For each merge, record the fusion interval $[D_\mathrm{lower}, D_\max(X_I)]$, representing the spread of distances within the merged set immediately prior to fusion.
- Update distances: For each new supercluster and each remaining cluster, compute updated distances using the generalized Lance–Williams recurrence.
- Iterate: Repeat steps 2–6 until only one cluster remains.
A compact pseudocode representation (Fernández et al., 2023):
1 2 3 4 5 6 7 8 9 10 11 12 |
Input: Distance matrix D among n objects.
Initialize clusters = [{x1}, ..., {xn}]
while len(clusters) > 1:
d_min = min_{A≠B in clusters} D(A,B)
TieGraph = build_graph_with_edges_for_D_equals_d_min(clusters, D)
for each connected component M of TieGraph:
if len(M) > 1:
supercluster = union(M)
fusion_interval = [min_{A,B ∈ M} D(A,B), max_{A,B ∈ M} D(A,B)]
clusters.remove(M)
clusters.add(supercluster)
Recompute all inter-cluster distances via chosen linkage formula |
3. Linkage Criteria and Their Variable-Group Extensions
Common linkage methods—defined for binary merges—extend naturally to variable-group merges. The following table summarizes the formula for computing for two superclusters and :
| Linkage Method | Formula | Notes |
|---|---|---|
| Single | Minimum distance | |
| Complete | Maximum distance | |
| UPGMA | Unweighted avg. | |
| WPGMA | Weighted avg. | |
| Ward | Centroid-based | |
| Centroid | Cluster means |
A multidendrogram may also be computed using parametric -flexible and versatile linkage (Fernández et al., 2023), generalizing the update rules to accommodate group merges.
4. Handling of Ties and Fusion Intervals
Non-binary dendrogram algorithms explicitly address ties in proximity data. All clusters that are part of a connected "tie component" in the tie graph are merged together. The procedure records the fusion interval $[D_\mathrm{lower}, D_\max(X_I)]$ for every non-binary merge, capturing the extent of internal dissimilarity within the fused group. This interval is visually represented as a band at the corresponding dendrogram node.
Standard binary algorithms obscure the presence of ties, breaking them arbitrarily and thus producing different dendrograms for the same data depending on input order or tie-breaking rule. In contrast, multidendrograms guarantee uniqueness for a given distance matrix and linkage rule (Gomez et al., 2012, Gomez et al., 2014). Fusion intervals maintain information about within-group heterogeneity at merge events.
5. Computational Complexity and Practical Considerations
The variable-group algorithm requires, for each clustering step:
- search for the minimum current distance among clusters,
- time to find connected components in the tie graph, with ,
- time per supercluster distance update via the generalized Lance–Williams recurrence.
Overall, the worst-case complexity matches standard agglomerative clustering at time and storage for objects (Gomez et al., 2012, Gomez et al., 2014). Optimizations such as priority queues, nearest-neighbor chains, and condensed storage arrays reduce practical computation, as demonstrated in implementations like MultiDendrograms (Java/C++) and the mdendro R package (Fernández et al., 2023). Empirical performance is frequently observed as time and space.
6. Application Examples and Visualization
Non-binary dendrogram methodology is particularly significant for data sets with discrete or coarse distance/similarity values, common in biological, chemical, or network contexts (Gomez et al., 2014). An example from Morgan (1995)’s soils data (Gomez et al., 2012) illustrates how traditional Complete Linkage clustering yields ambiguous binary dendrograms due to three-way ties; multidendrograms resolve this by merging all three soils at the same height, producing a ternary node with a fusion interval.
Visualization of multidendrograms distinctly marks non-binary nodes and fusion intervals with horizontal or banded bars, accurately portraying simultaneous mergers and internal heterogeneity. This faithful display of the data's clustering structure reduces ambiguity and enhances interpretability.
7. Advantages, Interpretations, and Summary
Non-binary/agglomerative clustering with multidendrograms provides several technical advantages:
- Deterministic output: Each distance matrix and linkage rule produces a unique dendrogram; no tie-breaking ambiguity.
- Faithful encoding of ties: Non-binary nodes denote true proximity structure; fusion intervals reflect heterogeneity.
- Generalization of all classical linkages: The variable-group Lance–Williams framework accommodates standard and parametric linkage rules.
- Improved interpretability: Especially for data with many ties (e.g., network modularity, genetic marker distances), the multidendrogram approach is critical for correct analysis (Fernández et al., 2023).
- Software support: Available in MultiDendrograms (Java/C++), mdendro (R) with extended linkage and diagnostic measures such as cophenetic correlation, agglomeration coefficient, and more.
A plausible implication is that in domains where the distance or similarity matrix exhibits frequent ties, the non-binary dendrogram algorithm should be preferred to ensure reproducibility and to avoid artifacts from arbitrary hierarchical splitting. This methodology has been instrumental in structural pattern analysis of complex systems (Gomez et al., 2014) and is applicable whenever non-uniqueness in hierarchical clustering poses challenges for scientific interpretation or inference.