Hierarchical MI: Theory & Applications

Updated 18 April 2026

Hierarchical Mutual Information (HMI) is an information-theoretic framework that generalizes standard mutual information to analyze multi-level, hierarchically structured data.
It decomposes information recursively and employs normalized metrics to compare hierarchical partitions, enhancing clustering and representation learning.
HMI underpins practical algorithms in network analysis, deep graph representation, and multimodal fusion, offering refined insights into complex data dynamics.

Hierarchical Mutual Information (HMI) is a class of information-theoretic measures and methodologies that generalize and extend standard mutual information (MI) to the analysis, comparison, and optimization of data structures and models with hierarchical or multi-level organization. HMI provides a principled framework for quantifying shared information between hierarchically organized entities, for constructing information-theoretic clustering and representation learning algorithms, and for rigorously comparing hierarchical partitions or cluster trees extracted by community detection or clustering procedures. Modern HMI theory subsumes classical MI as a special case, admits recursive and level-wise decompositions, and has been instantiated in a diverse array of applications ranging from network science to multi-view learning, multimodal fusion, financial time series, and deep graph representation learning.

1. Formal Definitions and Theoretical Foundations

The foundational HMI framework was formalized to address the limitations of standard MI when applied to structured objects such as dendrograms, hierarchical community trees, or nested partitions. Given two hierarchical partitions (trees) $\mathcal{T}$ and $\mathcal{S}$ over a ground set $U$ , HMI is defined recursively. Let each node $t$ of a tree have children $T_t$ and correspond to a subset $U_t \subseteq U$ . The hierarchical mutual information between subtrees $\mathcal{T}^t$ , $\mathcal{S}^s$ (rooted at nodes $t$ , $s$ ) is given by

$\mathcal{S}$ 0

where $\mathcal{S}$ 1 is the conditional mutual information between the children of $\mathcal{S}$ 2 and $\mathcal{S}$ 3 restricted to $\mathcal{S}$ 4. Probability terms $\mathcal{S}$ 5 denote the proportion of ground elements in both $\mathcal{S}$ 6 and $\mathcal{S}$ 7 within $\mathcal{S}$ 8 (Perotti et al., 2015, Perotti et al., 2020). For leaf nodes, the recursion terminates with $\mathcal{S}$ 9.

In balanced trees, HMI has a level-wise decomposition: $U$ 0 where $U$ 1, $U$ 2 denote the sets of nodes at level $U$ 3 in each tree (Perotti et al., 2020).

Hierarchical entropy, joint entropy, conditional entropy, and variation of information (VI) all admit analogous recursive and level-wise generalizations.

2. Properties, Normalization, and Metric Extensions

HMI inherits core nonnegativity and symmetry properties from Shannon MI: $U$ 4 and $U$ 5 (Perotti et al., 2015, Perotti et al., 2020). Self-information $U$ 6 defines the hierarchical entropy $U$ 7. A normalized HMI is constructed in direct analogy to the classical normalized mutual information: $U$ 8 with $U$ 9.

Hierarchical variation of information (HVI), defined as $t$ 0, is not strictly a metric due to potential violation of the triangle inequality for hierarchical partitions. A monotonic transformation

$t$ 1

yields a bona fide metric over hierarchical partition space (Perotti et al., 2020).

Random-coincidence bias (chance agreement) in HMI is addressed using an adjusted HMI (AHMI) which subtracts the expected value under random hierarchies, normalized by the maximal possible value given the entropies (Perotti et al., 2020).

3. Algorithms and Computational Methodologies

HMI has direct computational instantiations in several domains:

Hierarchical clustering (MIC): The Mutual Information Clustering (MIC) algorithm merges objects or clusters recursively using MI as a similarity measure and leverages the MI grouping property $t$ 2 to compute cluster similarities exactly after each merge. Distance metrics are normalized forms of conditional entropy (e.g., $t$ 3) (Kraskov et al., 2008). Algorithmic MI (Kolmogorov-based) can be used for sequences.
Bayesian clustering with dimensionality correction: Bayesian alternatives employ the log Bayes factor between models of dependence and independence for pairs/clusters, with asymptotic correction for the dimensionality of each cluster via a BIC-like penalty. This yields similarity scores proportional to empirical MI with explicit penalization, and the hierarchy construction is equipped with an automated stopping rule based on the sign of the log Bayes factor (Marrelec et al., 2015).
Graph representation learning: In hierarchical deep graph models, HMI is maximized between local (node/subgraph) and global (graph-level) representations at multiple coarsening levels. Mutual information is estimated via JSD or InfoNCE lower bounds between per-level local features and a global summary, enabling multi-scale representation learning (Ding et al., 2020).
Multi-view and multi-modal learning: HMI-based objectives jointly maximize MI at hierarchical levels, such as (1) inter-view MI for alignment and (2) cross-view MI for recovery in multi-view clustering with missing or misaligned views (Wang et al., 2023), or (1) inter-modality MI and (2) fusion-level MI in multimodal sentiment analysis (Han et al., 2021). Losses are constructed via contrastive or Barber–Agakov lower bounds.

4. Application Domains

HMI and its algorithmic counterparts have been extensively applied:

Network science: Quantitative comparison of hierarchical community structures, robustness analysis, detection algorithm benchmarking, and tracking of temporal evolution of network modularity (Perotti et al., 2015).
Multivariate and time series analysis: Construction of hierarchical, informationally grounded networks via the mutual information rate (MIR), e.g., in financial time series. Lempel–Ziv complexity estimates MIR, which is then used in constructing minimal spanning trees and planar maximally filtered graphs, revealing nonlinear dependencies not detected by Pearson correlation (Fiedor, 2014).
Deep learning and multimodal fusion: Information-maximizing objectives for preserving and transferring task-relevant information in fusion or representation learning pipelines (Han et al., 2021, Ding et al., 2020).
Hierarchical reinforcement learning: HRL architectures use MI maximization between state-action-option tuples to discover diverse and structured behaviors, with option networks trained via advantage-weighted, importance-sampled mutual information terms (Osa et al., 2019, Azarafrooz et al., 2019).

5. Empirical Findings and Comparative Performance

Empirical studies demonstrate the utility and interpretability of HMI:

Clustering fidelity: HMI-based methods outperform classical MI and correlation-based approaches in recovering ground-truth clusters in simulations with multidimensional data, including applications to functional MRI (Marrelec et al., 2015, Kraskov et al., 2008).
Robustness in networks: Normalized HMI decays smoothly under increasing noise or element rewiring, and estimates shared structure more sensitively at each hierarchical level than flat MI. Benchmarks indicate that methods like Infomap produce high HMI fidelity and consistency compared to hierarchical stochastic block model or recursive Louvain on both synthetic and empirical networks (Perotti et al., 2015).
Multi-view and multimodal learning: Ablation experiments show both class-level (contrastive) and instance-level (dual-prediction or entropy minimization) HMI objectives are necessary for SOTA clustering or fusion performance; dropping either objective results in measurable performance loss (Han et al., 2021, Wang et al., 2023).
Financial and dynamical systems: MIR-based networks using HMI uncover central nodes and hub-switching events not visible with correlations, indicating sensitivity to nonlinearities and dynamical structure (Fiedor, 2014).
Hierarchical reinforcement learning: Option policies discovered by MI maximization demarcate distinct behavioral modes, resulting in improved task performance and sample efficiency compared to baselines (Osa et al., 2019, Azarafrooz et al., 2019).

6. Limitations, Interpretability, and Open Directions

Despite its theoretical appeal, HMI is not without limitations. The hierarchical variation of information is not a metric unless transformed, and HMI can reflect random agreements in finite data, necessitating careful baseline adjustments (Perotti et al., 2020). Furthermore, accurate estimation in high dimensions is challenging, and normalization/adaptation techniques must be chosen with care to avoid size or dimensionality biases (Kraskov et al., 2008, Marrelec et al., 2015).

On the interpretability axis, methods that maximize HMI at multiple levels produce more structurally meaningful and interpretable latent representations in graph learning and clustering, as evidenced by visualizations of subgraph motifs and community structures (Ding et al., 2020, Perotti et al., 2015). Hierarchical information decomposition provides insights into the scale and localization of shared information between different structured models or datasets.

Several open problems include the formal proof of HMI’s normalization bounds, efficient algorithms for large-scale trees or graphs, and deeper understanding of HMI in cases with varying tree depths or missing labels (Perotti et al., 2020).