Papers
Topics
Authors
Recent
Search
2000 character limit reached

Non-Binary Dendrogram Algorithm

Updated 10 November 2025
  • Non-binary dendrogram algorithms are clustering methods that merge all tied clusters simultaneously, yielding unique and reproducible hierarchical structures.
  • They extend classical Lance–Williams formulas to accommodate group merges and record fusion intervals, which faithfully capture within-cluster variability.
  • These methods are widely applied in complex network analysis and genetic profiling to resolve ambiguities in data with many equidistant or tied observations.

Non-binary dendrogram algorithms, also referred to as variable-group agglomerative hierarchical clustering or multidendrograms, generalize classical hierarchical clustering by allowing the simultaneous merging of more than two clusters when ties occur in proximity data. Unlike pair-group (binary) methods, which merge one cluster pair per step and arbitrarily break ties, non-binary approaches construct unique, deterministic hierarchical structures even in the presence of multiple equidistant cluster pairs. This property is essential for reliable analysis of data sets—such as complex networks or genetic profiles—where discrete distance or similarity values induce frequent ties, resolving significant ambiguities inherent to standard methods (Gomez et al., 2012, Gomez et al., 2014, Fernández et al., 2023).

1. Mathematical Definition and General Formulation

Let Ω={x1,x2,,xn}\Omega = \{x_1, x_2, \ldots, x_n\} be the set of nn objects to be clustered. At each stage, the current partition consists of clusters {X1,,Xm}\{X_1,\ldots, X_m\}, represented by a symmetric dissimilarity matrix D(Xi,Xj)D(X_i, X_j) with D(Xi,Xj)=D(Xj,Xi)D(X_i, X_j) = D(X_j, X_i) and D(Xi,Xi)=0D(X_i,X_i)=0.

The core innovation in non-binary algorithms is that for any set of clusters {Xi1,,Xip}\{X_{i_1},\ldots,X_{i_p}\} that are mutually tied at the minimum inter-cluster distance, all are merged simultaneously into a supercluster XI=iIXiX_I = \bigcup_{i\in I} X_i. For any two multi-indexed clusters XIX_I and XJX_J, the inter-cluster distance is generalized via the “variable-group” Lance–Williams formula (Gomez et al., 2012, Fernández et al., 2023):

D(XI,XJ)=iIjJαijD(Xi,Xj)+i<iIβiiD(Xi,Xi)+j<jJβjjD(Xj,Xj)+δiIjJγij[Dmax(XI,XJ)D(Xi,Xj)](1δ)iIjJγij[D(Xi,Xj)Dmin(XI,XJ)]D(X_I, X_J) = \sum_{i\in I}\sum_{j\in J} \alpha_{ij} D(X_i, X_j) + \sum_{i<i' \in I} \beta_{ii'} D(X_i, X_{i'}) + \sum_{j<j' \in J} \beta_{jj'} D(X_j, X_{j'}) + \delta \sum_{i\in I}\sum_{j\in J} \gamma_{ij}[D_{\max}(X_I, X_J)-D(X_i, X_j)] - (1-\delta) \sum_{i\in I}\sum_{j\in J} \gamma_{ij}[D(X_i, X_j)-D_{\min}(X_I, X_J)]

Here, DminD_{\min} and DmaxD_{\max} are, respectively, the minimal and maximal inter-object distances as appropriate for the linkage definition. The parameters αij\alpha_{ij}, βii\beta_{ii'}, γij\gamma_{ij}, and δ\delta are specified to recover the standard linkage criteria (single, complete, UPGMA, WPGMA, centroid, Ward) when only binary merges are performed.

2. Algorithmic Procedure and Pseudocode

The non-binary (variable-group) agglomerative clustering proceeds iteratively:

  1. Initialization: Each object xix_i forms its own cluster XiX_i.
  2. Find global minimum: Compute Dlower=mini<jD(Xi,Xj)D_\mathrm{lower} = \min_{i<j} D(X_i, X_j).
  3. Detect ties: Identify all cluster pairs (Xi,Xj)(X_i, X_j) at DlowerD_\mathrm{lower}, construct a graph with edges for each such pair, and find connected components (tie components).
  4. Merge tied clusters: Merge all clusters in each connected component simultaneously into a supercluster. Components consisting of single clusters remain unmerged.
  5. Record fusion intervals: For each merge, record the fusion interval $[D_\mathrm{lower}, D_\max(X_I)]$, representing the spread of distances within the merged set immediately prior to fusion.
  6. Update distances: For each new supercluster XIX_I and each remaining cluster, compute updated distances using the generalized Lance–Williams recurrence.
  7. Iterate: Repeat steps 2–6 until only one cluster remains.

A compact pseudocode representation (Fernández et al., 2023):

1
2
3
4
5
6
7
8
9
10
11
12
Input: Distance matrix D among n objects.
Initialize clusters = [{x1}, ..., {xn}]
while len(clusters) > 1:
    d_min = min_{AB in clusters} D(A,B)
    TieGraph = build_graph_with_edges_for_D_equals_d_min(clusters, D)
    for each connected component M of TieGraph:
        if len(M) > 1:
            supercluster = union(M)
            fusion_interval = [min_{A,B  M} D(A,B), max_{A,B  M} D(A,B)]
            clusters.remove(M)
            clusters.add(supercluster)
    Recompute all inter-cluster distances via chosen linkage formula

3. Linkage Criteria and Their Variable-Group Extensions

Common linkage methods—defined for binary merges—extend naturally to variable-group merges. The following table summarizes the formula for computing D(XI,XJ)D(X_I, X_J) for two superclusters XIX_I and XJX_J:

Linkage Method Formula Notes
Single miniI,jJD(Xi,Xj)\min_{i\in I,\, j\in J} D(X_i, X_j) Minimum distance
Complete maxiI,jJD(Xi,Xj)\max_{i\in I,\, j\in J} D(X_i, X_j) Maximum distance
UPGMA 1XIXJiIjJXiXjD(Xi,Xj)\frac{1}{|X_I||X_J|} \sum_{i\in I}\sum_{j\in J}|X_i||X_j|D(X_i, X_j) Unweighted avg.
WPGMA 1IJiIjJD(Xi,Xj)\frac{1}{|I||J|}\sum_{i\in I}\sum_{j\in J} D(X_i, X_j) Weighted avg.
Ward XIXJXI+XJμIμJ2\frac{|X_I||X_J|}{|X_I|+|X_J|} \|\mu_I - \mu_J\|^2 Centroid-based
Centroid μIμJ2\|\mu_I - \mu_J\|^2 Cluster means

A multidendrogram may also be computed using parametric β\beta-flexible and versatile linkage (Fernández et al., 2023), generalizing the update rules to accommodate group merges.

4. Handling of Ties and Fusion Intervals

Non-binary dendrogram algorithms explicitly address ties in proximity data. All clusters that are part of a connected "tie component" in the tie graph are merged together. The procedure records the fusion interval $[D_\mathrm{lower}, D_\max(X_I)]$ for every non-binary merge, capturing the extent of internal dissimilarity within the fused group. This interval is visually represented as a band at the corresponding dendrogram node.

Standard binary algorithms obscure the presence of ties, breaking them arbitrarily and thus producing different dendrograms for the same data depending on input order or tie-breaking rule. In contrast, multidendrograms guarantee uniqueness for a given distance matrix and linkage rule (Gomez et al., 2012, Gomez et al., 2014). Fusion intervals maintain information about within-group heterogeneity at merge events.

5. Computational Complexity and Practical Considerations

The variable-group algorithm requires, for each clustering step:

  • O(m2)O(m^2) search for the minimum current distance among mm clusters,
  • O(m+E)O(m+|E|) time to find connected components in the tie graph, with E=O(m2)|E| = O(m^2),
  • O(IJ)O(|I||J|) time per supercluster distance update via the generalized Lance–Williams recurrence.

Overall, the worst-case complexity matches standard agglomerative clustering at O(n3)O(n^3) time and O(n2)O(n^2) storage for nn objects (Gomez et al., 2012, Gomez et al., 2014). Optimizations such as priority queues, nearest-neighbor chains, and condensed storage arrays reduce practical computation, as demonstrated in implementations like MultiDendrograms (Java/C++) and the mdendro R package (Fernández et al., 2023). Empirical performance is frequently observed as O(n2)O(n^2) time and O(n2)O(n^2) space.

6. Application Examples and Visualization

Non-binary dendrogram methodology is particularly significant for data sets with discrete or coarse distance/similarity values, common in biological, chemical, or network contexts (Gomez et al., 2014). An example from Morgan (1995)’s soils data (Gomez et al., 2012) illustrates how traditional Complete Linkage clustering yields ambiguous binary dendrograms due to three-way ties; multidendrograms resolve this by merging all three soils {3,15,20}\{3,15,20\} at the same height, producing a ternary node with a fusion interval.

Visualization of multidendrograms distinctly marks non-binary nodes and fusion intervals with horizontal or banded bars, accurately portraying simultaneous mergers and internal heterogeneity. This faithful display of the data's clustering structure reduces ambiguity and enhances interpretability.

7. Advantages, Interpretations, and Summary

Non-binary/agglomerative clustering with multidendrograms provides several technical advantages:

  • Deterministic output: Each distance matrix and linkage rule produces a unique dendrogram; no tie-breaking ambiguity.
  • Faithful encoding of ties: Non-binary nodes denote true proximity structure; fusion intervals reflect heterogeneity.
  • Generalization of all classical linkages: The variable-group Lance–Williams framework accommodates standard and parametric linkage rules.
  • Improved interpretability: Especially for data with many ties (e.g., network modularity, genetic marker distances), the multidendrogram approach is critical for correct analysis (Fernández et al., 2023).
  • Software support: Available in MultiDendrograms (Java/C++), mdendro (R) with extended linkage and diagnostic measures such as cophenetic correlation, agglomeration coefficient, and more.

A plausible implication is that in domains where the distance or similarity matrix exhibits frequent ties, the non-binary dendrogram algorithm should be preferred to ensure reproducibility and to avoid artifacts from arbitrary hierarchical splitting. This methodology has been instrumental in structural pattern analysis of complex systems (Gomez et al., 2014) and is applicable whenever non-uniqueness in hierarchical clustering poses challenges for scientific interpretation or inference.

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Non-Binary Dendrogram Algorithm.