MIA Clustering Algorithm

Updated 6 January 2026

MIA Clustering Algorithm is a hierarchical method that quantifies similarity using mutual information in both probabilistic (Shannon) and algorithmic (Kolmogorov) contexts.
It transforms MI into normalized distance and similarity metrics, enabling recursive cluster formation through an agglomerative approach.
The method has practical applications in constructing phylogenetic trees in genomics and extracting fetal ECG signals in biomedical signal processing.

The Mutual Information Agglomerative (MIA) clustering algorithm, also known as Mutual Information Clustering (MIC), is an agglomerative hierarchical clustering method in which the similarity between objects and clusters is quantified using mutual information (MI). MIC/MIA exploits the grouping property of MI to construct clusters recursively, providing a unified framework for clustering in both Shannon (probabilistic) and Kolmogorov (algorithmic) information theory contexts. This approach supports applications ranging from sequence analysis in genomics to source separation in signal processing (Kraskov et al., 2008).

1. Theoretical Principles

Mutual information, in both Shannon and algorithmic formulations, measures the shared information between random variables or symbol strings. For discrete random variables $X$ , $Y$ , the Shannon MI is defined as

$I(X;Y) = H(X) + H(Y) - H(X,Y) = \sum_{i,j} p_{ij} \ln \frac{p_{ij}}{p_i(X)\,p_j(Y)}$

where $H(X)$ and $H(Y)$ are marginal entropies, and $H(X,Y)$ is joint entropy. Conditional MI and MI for more than two variables are used in extended formulations.

The grouping property underpins the hierarchical aspect: for three objects $X$ , $Y$ , $Z$ ,

$I(X,Y,Z) = I(X;Y) + I((X,Y);Z)$

which recursively generalizes to arbitrary cluster merges.

The algorithmic MI (Kolmogorov variant) is approximated for strings $Y$ 0, $Y$ 1 as

$Y$ 2

where $Y$ 3 denotes Kolmogorov complexity, often estimated via compression.

2. MI-Based Distance and Similarity Metrics

Clustering requires converting MI to a proximity measure. The raw MI-based "distance" is

$Y$ 4

which is symmetric and non-negative but unnormalized. Normalization yields metrics:

$Y$ 5
$Y$ 6

Both $Y$ 7 and $Y$ 8 are true metrics, satisfying $Y$ 9 and the triangle inequality.

For continuous variables, dimension-normalized similarities are introduced:

$I(X;Y) = H(X) + H(Y) - H(X,Y) = \sum_{i,j} p_{ij} \ln \frac{p_{ij}}{p_i(X)\,p_j(Y)}$ 0
$I(X;Y) = H(X) + H(Y) - H(X,Y) = \sum_{i,j} p_{ij} \ln \frac{p_{ij}}{p_i(X)\,p_j(Y)}$ 1 with $I(X;Y) = H(X) + H(Y) - H(X,Y) = \sum_{i,j} p_{ij} \ln \frac{p_{ij}}{p_i(X)\,p_j(Y)}$ 2, $I(X;Y) = H(X) + H(Y) - H(X,Y) = \sum_{i,j} p_{ij} \ln \frac{p_{ij}}{p_i(X)\,p_j(Y)}$ 3 denoting variable or cluster dimensionality.

3. Algorithmic Workflow

MIC/MIA is an agglomerative hierarchical clustering procedure:

Initialization: Each object forms a singleton cluster.
Proximity Calculation: Compute all pairwise proximities (distance $I(X;Y) = H(X) + H(Y) - H(X,Y) = \sum_{i,j} p_{ij} \ln \frac{p_{ij}}{p_i(X)\,p_j(Y)}$ 4 or similarity $I(X;Y) = H(X) + H(Y) - H(X,Y) = \sum_{i,j} p_{ij} \ln \frac{p_{ij}}{p_i(X)\,p_j(Y)}$ 5).
- Shannon version: Estimate MI via discrete frequency counts or k-nearest-neighbor estimators for continuous data.
- Algorithmic version: Approximate MI by compression lengths, e.g., using XM or lpaq1 compressors.
Iterative Agglomeration:
- Select cluster pair with minimal distance (or maximal similarity).
- Merge clusters, update dendrogram at height=joint MI or pairwise distance.
- Update proximity matrix by recalculating distances between new cluster and all remaining clusters, treating each merged cluster as a single object via coordinate concatenation (Shannon) or string concatenation (algorithmic).
Termination: When two clusters remain, merge to form the complete hierarchy.

The following pseudocode summarizes the procedure:

$H(X)$ 2

4. Illustrative Applications

MIC/MIA has demonstrated utility in both genomics and biomedical signal processing.

Phylogenetic Tree Construction: Applied to mitochondrial DNA sequences of 34 mammalian species, each sequence is encoded as a symbol string. Clusters are formed via string concatenation, with MI estimated using compressed sequence lengths. The resulting rooted tree accurately reflects known primate, ferungulate, and other biological groupings. Internal node sequences represent supra-sequences generated by cluster concatenation, not actual ancestors (Kraskov et al., 2008).
Fetal ECG Extraction: For an 8-channel ECG recording (pregnant woman, 500 Hz, 5 s), data are embedded using Takens’ delay and ICA is performed by minimizing sum of pairwise MI between outputs via a k-NN estimator. MIC clusters the resulting 24 least-dependent components using normalized MI similarities, separating two dominant clusters (maternal and fetal ECG). Fetal heartbeat is isolated by projecting the original data onto the fetal cluster subspace—robustly decoupling signal despite overlap and noise.

5. Advantages and Limitations

MIC/MIA offers several conceptual and practical benefits:

Unified Framework: Applicable to probabilistic (Shannon) and algorithmic (Kolmogorov) information types by virtue of the MI grouping property, allowing identical treatment of objects and clusters.
Normalization: Proximity measures $I(X;Y) = H(X) + H(Y) - H(X,Y) = \sum_{i,j} p_{ij} \ln \frac{p_{ij}}{p_i(X)\,p_j(Y)}$ 6, $I(X;Y) = H(X) + H(Y) - H(X,Y) = \sum_{i,j} p_{ij} \ln \frac{p_{ij}}{p_i(X)\,p_j(Y)}$ 7 mitigate bias from object size, cluster composition, and dimensionality.
Conceptual Simplicity: Recursion exploits MI’s additivity, generating dendrograms where internal node heights reflect true joint MI or merge-distances.

Reported limitations include:

MI Estimation Challenges: High-dimensional continuous data or short samples may yield imprecise MI estimates, resulting in dendrogram “glitches” (non-well-formed trees).
Dependence on Compression: Algorithmic MI effectiveness is constrained by compressor fidelity; small strings or non-normal compressors may violate idempotency/symmetry.
Scalability: The quadratic $I(X;Y) = H(X) + H(Y) - H(X,Y) = \sum_{i,j} p_{ij} \ln \frac{p_{ij}}{p_i(X)\,p_j(Y)}$ 8 time and space census inherent in agglomerative proximity updating limits practicality for large datasets.
Interpretation in Phylogenetics: Internal nodes in MIC-based phylogenies are phenetic constructs arising from data concatenation and not true lineage ancestors.

6. Practical Implementation Guidelines

Effective application of MIC/MIA requires attention to data and method specifics:

For Shannon MI, use k-NN estimators with $I(X;Y) = H(X) + H(Y) - H(X,Y) = \sum_{i,j} p_{ij} \ln \frac{p_{ij}}{p_i(X)\,p_j(Y)}$ 9 selected according to bias/variance demands; small $H(X)$ 0 for independence scenarios and large $H(X)$ 1 for stable estimates when MI is large.
Algorithmic MI should use normal compressors (XM, lpaq1), ensuring approximate idempotency and symmetry of compression lengths.
Always normalize MI by joint entropy, cluster dimension, or maximum marginal entropy to ensure scale-independent clustering.
Dendrograms may be plotted using joint MI (method 1) for well-formed trees or merge distance (method 2) for interpretable branch lengths, with possible well-formedness violations if MI is misestimated.
Optionally, post-process MIC trees with topology refinement moves or optimized cluster representations, beyond the concatenation approach.

MIC/MIA provides a conceptually unified, information-theoretic method for hierarchical clustering, leveraging the exact MI grouping property and normalization. Its documented application in genomics and biomedical signal separation demonstrates robust performance across distinct data regimes (Kraskov et al., 2008).

Markdown Report Issue Upgrade to Chat

References (1)

MIC: Mutual Information based hierarchical Clustering (2008)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to MIA Clustering Algorithm.