Hierarchical ICL Maximization

Updated 29 December 2025

Hierarchical ICL Maximization is a clustering framework that integrates global ICL maximization via a hybrid genetic algorithm with hierarchical construction through Dirichlet regularization.
The method optimizes the integrated classification likelihood to simultaneously determine the optimal partition and number of clusters using robust genetic operations.
Empirical validations on diverse datasets demonstrate its capability to recover both fine- and coarse-grained structures, offering interpretable multi-scale representations.

Hierarchical ICL maximization is a model-based clustering framework that unifies the simultaneous discovery of cluster structure and model selection within discrete latent variable models (DLVMs) via the integrated classification likelihood (ICL) criterion. The approach consists of a two-phase procedure: (1) global maximization of the ICL using a hybrid genetic algorithm to yield a partition and cluster number, and (2) construction of a multi-scale hierarchy by regularizing with the Dirichlet prior parameter and employing a log-linear merge rule. The method rigorously connects regularization, model-based clustering, and hierarchy construction, with direct optimization performed on the exact integrated likelihood and probabilistic assignments.

1. The Integrated Classification Likelihood (ICL) in DLVMs

The ICL objective quantifies the plausibility of a clustering in a generative latent variable model. For data $X$ with latent hard assignments $Z \in \{0,1\}^{n \times K}$ , cluster-proportions $\Pi \sim \mathrm{Dir}_K(\alpha)$ , and within-cluster parameters $\theta$ , the ICL is defined as:

$\ICLex(Z; \alpha, \beta) = \log \int p(X \mid Z, \theta)\, p(\theta \mid \beta) [p(Z \mid \Pi)\, p(\Pi \mid \alpha)] \, d\theta\, d\Pi = \log p(X \mid Z, \beta) + \log p(Z \mid \alpha)$

where the first term is the integrated conditional log-likelihood, and the second term results from Dirichlet–multinomial conjugacy:

$\log p(Z \mid \alpha) = \log \frac{\Gamma(\alpha K)\, \prod_{k=1}^K \Gamma(n_k+\alpha)}{\Gamma(\alpha)^K\, \Gamma(n+\alpha K)}$

This decomposition enables principled model selection (number of clusters $K$ ) and accounts for the Occam’s razor effect (penalizing over-complex partitions).

2. Global ICL Maximization via Hybrid Genetic Algorithm

Standard greedy optimization of $\ICLex$ is prone to sub-optimal local maxima. The hybrid genetic algorithm (GA) overcomes this by employing a population of set partitions (not label vectors), eliminating label-switching redundancy and enabling effective structural crossover:

Population: Each individual is a partition $\{C_1, \dots, C_K\}$ .
Fitness: $\mathrm{fit}(P) = \ICLex(P)$.
Genetic operations:
- Selection: Rank-based.
- Crossover: Offspring blocks are all non-empty intersections $C_i^{(1)} \cap C_j^{(2)}$ of two parents.
- Mutation: Randomly splits one block.
- Local search: Greedy merges and swaps within the offspring, always increasing ICL.
Complexity: Each crossover and mutation is $O(n)$ ; swap/merge is $O(K)$ ; operations focus on offspring only.

Pseudocode excerpt:

Initialize population P_1,…,P_V (seeded + local swap).
For g = 1…G_max:
    Carry over best solution.
    For (V–1) pairs: 
        Crossover → child partition.
        Locally merge/split/swap (ICL-increasing).
        Add to next generation.
Return best solution found.

This procedure jointly selects both the optimal partition and the cluster number $K$ .

3. Hierarchical Construction Through Dirichlet Prior Regularization

Given an ICL-optimal partition, the second phase builds a hierarchy by leveraging the Dirichlet prior parameter $\alpha$ as a regularization knob controlling granularity:

Linearized criterion: For small $\alpha$ ,

$\ICLlin(Z, \alpha) = I(Z) + (K-1) \log \alpha$

where

$I(Z) = D(Z) - \log K + \sum_k \log\Gamma(n_k) - \log\Gamma(n)$

Greedy merge rule: At each step, merge the pair $(g, h)$ for which

$\alpha_{g,h} = \exp(I(Z_{g \cup h}) - I(Z^{(k)}))$

is maximized. This is the largest interval $\alpha$ where the merge dominates the parent partition.

Hierarchy construction: For decreasing $k$ , update partitions and $\alpha$ using the above rule until a single cluster remains.

Hierarchy pseudocode:

Input: Partition Z^(K), intercept I(·), α^(K) = 1.
For k = K…2:
    For pairs g<h: compute ΔI_{g,h}, α_{g,h}
    Select (g*, h*) with max α_{g,h}
    Merge, update Z^(k-1), α^(k-1)
Return sequence of partitions and α-path.

Non-dominant merges (where the new α increases) are pruned, producing an approximate Pareto frontier of optimal granularities.

4. Theoretical Interpretation and Regularization Path

The hierarchy induced by varying $\alpha$ corresponds to a multi-scale view of the data: high $\alpha$ yields many clusters; low $\alpha$ merges aggressively, yielding a coarser clustering. The linearized criterion is affine in $\log \alpha$ , and the explicit α-path offers interpretable model selection and visualization possibilities (dendrograms with axes $(-\log\alpha_f)$ ; optimal leaf ordering via block-sum minimization).

A plausible implication is that the α-regularization perspective connects hierarchical clustering with Bayesian model selection explicitly, such that the entire sequence of clusterings from fine to coarse can be understood as traversing a regularization path in model space (Côme et al., 2020).

5. Empirical Validation and Performance

Extensive empirical evaluation demonstrates the superiority and robustness of hierarchical ICL maximization:

SBM (stochastic block model) simulations: Hybrid GA recovers all underlying clusters in up to 93% of runs (NMI ≈ 1.0), outperforming greedy and EM/BIC/AIC approaches.
Mixtures of multinomials: Hybrid GA attains the correct number of clusters more frequently and with higher ICL/NMI than standard methods.
Real networks: On diverse graphs (Blogs, Books, Football, Jazz), the procedure consistently yields the highest $\ICLex$ and lowest variance, even with small population sizes.
French parliamentary co-voting: The method initially discovers fine block co-clusters, then hierarchical pruning reveals interpretable coarser structures aligned with real political groupings (NMI ≈ 0.8 with party labels).
All cases: The hierarchy enables interpretation at multiple scales by extracting a nested set of partitions and their merge α-paths (Côme et al., 2020).

6. Practical Considerations and Generalization

Key hyperparameters include population size ( $V$ ), mutation rate ( $p_m$ ), and maximum generations ( $G_{\text{max}}$ ) for the GA, as well as the initial value and updating schedule for $\alpha$ in Phase II. The GA-based maximization is parallelizable and scalable due to efficient representations and operations. Merges are computationally trivial at each step ( $O(K^2)$ ), and the linearization in $\log\alpha$ is robust for sufficiently small $\alpha$ .

A plausible extension is that this framework applies to any DLVM with tractable integrated classification likelihood, including degree-corrected SBMs, mixtures of multinomials, and relational models with conjugate priors. The entire approach is model-agnostic within the family of tractable DLVMs, yielding both cluster structures and interpretable hierarchical relationships under the unified principle of ICL maximization (Côme et al., 2020).

PDF Markdown Chat (Pro)

References (1)

Hierarchical clustering with discrete latent variable models and the integrated classification likelihood (2020)

Whiteboard

Generate a whiteboard explanation of this topic.

Topic to Video (Beta)

Generate a video overview of this topic.

Follow Topic

Get notified by email when new papers are published related to Hierarchical ICL Maximization.