Stability-Validated K-Prototypes Clustering

Updated 10 January 2026

The paper presents a stability-based method that evaluates clustering reproducibility by balancing global (between-cluster) and local (within-cluster) stability.
It leverages the K-Prototypes algorithm to handle mixed numerical and categorical data through tailored perturbations and combined distance measures.
Empirical results show that the Stadion criterion outperforms traditional indices, ensuring robust model selection while mitigating subcluster overfitting.

Stability-Validated K-Prototypes Clustering is a model selection approach for clustering mixed-type data, rooted in the principle that an optimal clustering should be reproducible under data perturbations and show no stable sub-partitions within its clusters. This methodology, formalized in the Stadion criterion, provides a rigorous internal validation mechanism, extending stability-based cluster validation specifically to the mixed-data regime addressed by the K-Prototypes algorithm. The validity of a clustering configuration is assessed as a trade-off: maximizing stability of the clustering as a whole across perturbations, while ensuring that individual clusters do not admit reproducible internal structure under the same perturbations (Mourer et al., 2020).

1. Theoretical Principle and Formulation

The foundation of the Stability-Validated K-Prototypes framework is the assertion that “a good clustering is one which (a) is stable under small perturbations of the data, and (b) within each cluster no further stable partition exists.” This two-part definition encapsulates both global reproducibility and local indivisibility.

Let $X = \{x_1, ..., x_N\} \subset \mathbb{R}^{p_{\mathrm{num}}} \times \{\,\text{categories}\}$ denote a dataset with mixed numerical and categorical attributes. For a given $K$ , K-Prototypes yields a partition $\mathcal{C}_K = \{C_1, ..., C_K\}$ , and $s(\mathcal{C}, \mathcal{C}') \in [0, 1]$ quantifies the similarity between two labelings, typically using the Adjusted Rand Index (ARI).

Between-Cluster Stability

Between-cluster stability $\mathrm{Stab}_B$ is defined as the mean similarity between a reference clustering and clusterings of $D$ perturbed versions of the data: $\mathrm{Stab}_B(\mathcal{A}, X, K) = \frac{1}{D} \sum_{d=1}^D s\bigl(\mathcal{C}_K, \mathcal{C}_K^{(d)}\bigr)$ where $\mathcal{C}_K^{(d)} = \mathcal{A}(X^{(d)}, K)$ , and each $X^{(d)}$ is a perturbation of $X$ (additive noise to numericals, bootstrap or flip on categorical values).

Within-Cluster Stability

For a set $\Omega$ of candidate subcluster counts (e.g., $\{2,3,4\}$ ), each cluster $C_k$ is re-clustered with $K' \in \Omega$ both in the original data and under perturbations. For each $C_k$ and $K'$ , within-cluster stability is

$\mathrm{Stab}_B(\mathcal{A}, C_k, K') = \frac{1}{D} \sum_{d=1}^D s\left( \mathcal{C}_{k,K'}, \mathcal{C}_{k,K'}^{(d)} \right)$

Aggregate within-cluster stability $\mathrm{Stab}_W$ is the cluster-size-weighted average of these stabilities over all $k$ and $K'$ .

Stadion Criterion

The Stadion index is the stability difference to be maximized: $\mathrm{Stadion}(\mathcal{A}, X, K, \Omega) = \mathrm{Stab}_B(\mathcal{A}, X, K) - \mathrm{Stab}_W(\mathcal{A}, X, K, \Omega)$ Optimal $K$ is the maximizer $\hat K = \arg\max_{K \in \mathcal{K}_{\text{set}}} \mathrm{Stadion}(\mathcal{A}, X, K, \Omega)$ .

2. Algorithmic Procedure

The procedure for Stability-Validated K-Prototypes is outlined as follows:

Input: data $X$ ; candidate cluster counts $\mathcal{K}_{\text{set}}$ ; subcluster set $\Omega$ ; number of perturbations $D$ ; noise levels grid $\{\epsilon_1, ..., \epsilon_M\}$ .
For each $K$ $K$ , and each noise level $\epsilon_m$ $ϵ_{m}$ :
- Generate $D$ perturbed realizations $X^{(d)}$ .
- Cluster each perturbed $X^{(d)}$ with K-Prototypes to obtain $\mathcal{C}_K^{(d)}$ .
- Compute $\mathrm{Stab}_B$ as mean similarity (ARI) between $\mathcal{C}_K$ and $\mathcal{C}_K^{(d)}$ .
- For every cluster $C_k$ , and for each $K' \in \Omega$ :
- Re-cluster $C_k$ and its perturbed versions $C_k^{(d)}$ with $K'$ .
- Compute $\mathrm{Stab}_B(\mathcal{A}, C_k, K')$ and aggregate to obtain $\mathrm{Stab}_W$ .
- Compute the Stadion index per $K$ and noise level.
Aggregate the Stadion index over noise levels (max or mean).
Return the $K$ maximizing aggregated Stadion, along with full stability paths for diagnostic visualization.

Pseudocode is provided in the source for a step-by-step workflow, including cluster restriction, weighting, and perturbation strategies.

3. Adaptation to Mixed-Type Data and Perturbation Schemes

K-Prototypes clustering is designed for mixed numerical and categorical datasets, using a combined distance: $d(x, y) = \sum_{j \in \text{num}} (x_j - y_j)^2 + \gamma \sum_{j \in \text{cat}} \mathbf{1}_{\{x_j \neq y_j\}}$ where $\gamma$ is a fixed weight for categorical mismatches.

Perturbation Strategy

Numerical features: Add uniform $\pm\epsilon$ or Gaussian $N(0, \sigma^2)$ noise, after feature scaling to zero mean and unit variance.
Categorical features: Either bootstrap within-column (resampling with replacement), or independently flip values with small probability $p \approx \epsilon_{\text{cat}}$ , to jitter cluster boundaries.

Similarity is evaluated via ARI on the induced labelings. When comparing reference and perturbed clusterings, ARI is computed by transferring cluster labels from perturbed samples back to the original observations using sample correspondence.

Alternative Similarity Measures

A cluster co-occurrence matrix $M \in [0,1]^{N \times N}$ , with

$M_{ij} = \frac{1}{D} \sum_{d=1}^D \mathbf{1}_{\{\ell^{(d)}(i) = \ell^{(d)}(j)\}}$

permits alternative definitions of stability, using within- and between-cluster co-occurrences.

4. Computational and Practical Considerations

The computational cost is dominated by repeated K-Prototypes clusterings across perturbations, noise levels, and cluster sizes: $O\left( (K N T) D M + K |\Omega| (N_k T) D M \right) \approx O(K |\Omega| N T D M)$ with $T$ the number of iterations per clustering. In typical practice, $|\Omega|$ is 2–5, $D \approx 5$ –10, and $M \approx 10$ .

Empirical observations suggest $D = 1$ or 2 already gives low-variance estimates, but $D = 5$ –10 is robust. The noise-level grid $\epsilon \in [0, \epsilon_{\max}]$ should cover up to $\epsilon_{\max} \approx \sqrt{p_{\mathrm{num}}}$ , so that at maximum noise all meaningful structure is dissolved and $K=1$ emerges as optimal.

Initialization procedures (e.g., greedy, multiple restarts) should be consistent across runs to mitigate non-deterministic cluster assignments. Perturbations of the categorical component (bootstrapping or flipping) must be balanced: if category level proportions are highly imbalanced, bootstrapping is recommended.

5. Empirical Performance and Validation Approaches

Mourer et al. (Mourer et al., 2020) evaluated Stadion on 73 numerically-typed datasets (Gaussian mixtures, shapes, UCI benchmarks), comparing over 20 internal validation indices. Stadion-max with additive noise consistently ranked among the top two methods, outperforming conventional indices such as Silhouette, Dunn, Calinski-Harabasz (CH), Gap, and prior stability-based techniques.

For mixed-type evaluation, the prescribed protocol is:

Assemble or simulate benchmark datasets with ground-truth clusters, including both synthetic (Gaussian numericals + Dirichlet-noise categorical prototypes) and real (e.g., UCI Adult, Heart Disease datasets).
Apply Stability-Validated K-Prototypes as described above.
Compare the selected $\hat K$ to true $K^*$ (win counts), and compute $\mathrm{ARI}(\mathcal{C}_{\hat K}, \mathcal{C}_{K^*})$ .
Use baselines tailored to mixed data: adaptations of Silhouette with Gower distance, WB-index with Gower, categorical CH, EM-based BIC, mixed-data indices (e.g., E–CV index), or mixed-consensus clustering.
Inspect stability paths plotted against noise to confirm the expected dissolution of subclusters and cluster merges as perturbation increases.

The expected empirical outcome is that Stadion-max remains a top-performing internal criterion for mixed data, contingent on an appropriate balance between numerical noise and categorical perturbation.

6. Connections, Limitations, and Perspectives

Stability-based validation circumvents the ill-defined objectives of unsupervised clustering by appealing to reproducibility under controlled perturbations. The Stadion criterion subsumes previous notions of stability but overcomes the inadequacy of classical stability for detecting underestimation of $K$ by penalizing within-cluster stability. This framework is model-agnostic, but its adaptation to K-Prototypes is central for practical handling of heterogeneous attributes.

A plausible implication is that the explicit modeling of both global and local stability can extend to other mixed-type clustering algorithms, provided suitable perturbation and similarity definitions. However, the computational overhead is substantial, and the performance is governed by the adequacy of perturbation schemes and similarity measures for both data types. The method is inherently internal and does not leverage possible external or semi-supervised information.

The neutral point is: empirical results demonstrate top-tier performance for Stadion-max as an internal criterion, but its generalization to non-K-Prototypes algorithms or very high-dimensional mixed data remains to be explored (Mourer et al., 2020).

PDF Markdown Chat (Pro)

References (1)

Selecting the Number of Clusters $K$ with a Stability Trade-off: an Internal Validation Criterion (2020)

Whiteboard

Generate a whiteboard explanation of this topic.

Topic to Video (Beta)

Generate a video overview of this topic.

Follow Topic

Get notified by email when new papers are published related to Stability-Validated K-Prototypes Clustering.