Papers
Topics
Authors
Recent
2000 character limit reached

Stability-Validated K-Prototypes Clustering

Updated 10 January 2026
  • The paper presents a stability-based method that evaluates clustering reproducibility by balancing global (between-cluster) and local (within-cluster) stability.
  • It leverages the K-Prototypes algorithm to handle mixed numerical and categorical data through tailored perturbations and combined distance measures.
  • Empirical results show that the Stadion criterion outperforms traditional indices, ensuring robust model selection while mitigating subcluster overfitting.

Stability-Validated K-Prototypes Clustering is a model selection approach for clustering mixed-type data, rooted in the principle that an optimal clustering should be reproducible under data perturbations and show no stable sub-partitions within its clusters. This methodology, formalized in the Stadion criterion, provides a rigorous internal validation mechanism, extending stability-based cluster validation specifically to the mixed-data regime addressed by the K-Prototypes algorithm. The validity of a clustering configuration is assessed as a trade-off: maximizing stability of the clustering as a whole across perturbations, while ensuring that individual clusters do not admit reproducible internal structure under the same perturbations (Mourer et al., 2020).

1. Theoretical Principle and Formulation

The foundation of the Stability-Validated K-Prototypes framework is the assertion that “a good clustering is one which (a) is stable under small perturbations of the data, and (b) within each cluster no further stable partition exists.” This two-part definition encapsulates both global reproducibility and local indivisibility.

Let X={x1,...,xN}Rpnum×{categories}X = \{x_1, ..., x_N\} \subset \mathbb{R}^{p_{\mathrm{num}}} \times \{\,\text{categories}\} denote a dataset with mixed numerical and categorical attributes. For a given KK, K-Prototypes yields a partition CK={C1,...,CK}\mathcal{C}_K = \{C_1, ..., C_K\}, and s(C,C)[0,1]s(\mathcal{C}, \mathcal{C}') \in [0, 1] quantifies the similarity between two labelings, typically using the Adjusted Rand Index (ARI).

Between-Cluster Stability

Between-cluster stability StabB\mathrm{Stab}_B is defined as the mean similarity between a reference clustering and clusterings of DD perturbed versions of the data: StabB(A,X,K)=1Dd=1Ds(CK,CK(d))\mathrm{Stab}_B(\mathcal{A}, X, K) = \frac{1}{D} \sum_{d=1}^D s\bigl(\mathcal{C}_K, \mathcal{C}_K^{(d)}\bigr) where CK(d)=A(X(d),K)\mathcal{C}_K^{(d)} = \mathcal{A}(X^{(d)}, K), and each X(d)X^{(d)} is a perturbation of XX (additive noise to numericals, bootstrap or flip on categorical values).

Within-Cluster Stability

For a set Ω\Omega of candidate subcluster counts (e.g., {2,3,4}\{2,3,4\}), each cluster CkC_k is re-clustered with KΩK' \in \Omega both in the original data and under perturbations. For each CkC_k and KK', within-cluster stability is

StabB(A,Ck,K)=1Dd=1Ds(Ck,K,Ck,K(d))\mathrm{Stab}_B(\mathcal{A}, C_k, K') = \frac{1}{D} \sum_{d=1}^D s\left( \mathcal{C}_{k,K'}, \mathcal{C}_{k,K'}^{(d)} \right)

Aggregate within-cluster stability StabW\mathrm{Stab}_W is the cluster-size-weighted average of these stabilities over all kk and KK'.

Stadion Criterion

The Stadion index is the stability difference to be maximized: Stadion(A,X,K,Ω)=StabB(A,X,K)StabW(A,X,K,Ω)\mathrm{Stadion}(\mathcal{A}, X, K, \Omega) = \mathrm{Stab}_B(\mathcal{A}, X, K) - \mathrm{Stab}_W(\mathcal{A}, X, K, \Omega) Optimal KK is the maximizer K^=argmaxKKsetStadion(A,X,K,Ω)\hat K = \arg\max_{K \in \mathcal{K}_{\text{set}}} \mathrm{Stadion}(\mathcal{A}, X, K, \Omega).

2. Algorithmic Procedure

The procedure for Stability-Validated K-Prototypes is outlined as follows:

  • Input: data XX; candidate cluster counts Kset\mathcal{K}_{\text{set}}; subcluster set Ω\Omega; number of perturbations DD; noise levels grid {ϵ1,...,ϵM}\{\epsilon_1, ..., \epsilon_M\}.
  • For each KK, and each noise level ϵm\epsilon_m:
    • Generate DD perturbed realizations X(d)X^{(d)}.
    • Cluster each perturbed X(d)X^{(d)} with K-Prototypes to obtain CK(d)\mathcal{C}_K^{(d)}.
    • Compute StabB\mathrm{Stab}_B as mean similarity (ARI) between CK\mathcal{C}_K and CK(d)\mathcal{C}_K^{(d)}.
    • For every cluster CkC_k, and for each KΩK' \in \Omega:
    • Re-cluster CkC_k and its perturbed versions Ck(d)C_k^{(d)} with KK'.
    • Compute StabB(A,Ck,K)\mathrm{Stab}_B(\mathcal{A}, C_k, K') and aggregate to obtain StabW\mathrm{Stab}_W.
    • Compute the Stadion index per KK and noise level.
  • Aggregate the Stadion index over noise levels (max or mean).
  • Return the KK maximizing aggregated Stadion, along with full stability paths for diagnostic visualization.

Pseudocode is provided in the source for a step-by-step workflow, including cluster restriction, weighting, and perturbation strategies.

3. Adaptation to Mixed-Type Data and Perturbation Schemes

K-Prototypes clustering is designed for mixed numerical and categorical datasets, using a combined distance: d(x,y)=jnum(xjyj)2+γjcat1{xjyj}d(x, y) = \sum_{j \in \text{num}} (x_j - y_j)^2 + \gamma \sum_{j \in \text{cat}} \mathbf{1}_{\{x_j \neq y_j\}} where γ\gamma is a fixed weight for categorical mismatches.

Perturbation Strategy

  • Numerical features: Add uniform ±ϵ\pm\epsilon or Gaussian N(0,σ2)N(0, \sigma^2) noise, after feature scaling to zero mean and unit variance.
  • Categorical features: Either bootstrap within-column (resampling with replacement), or independently flip values with small probability pϵcatp \approx \epsilon_{\text{cat}}, to jitter cluster boundaries.

Similarity is evaluated via ARI on the induced labelings. When comparing reference and perturbed clusterings, ARI is computed by transferring cluster labels from perturbed samples back to the original observations using sample correspondence.

Alternative Similarity Measures

A cluster co-occurrence matrix M[0,1]N×NM \in [0,1]^{N \times N}, with

Mij=1Dd=1D1{(d)(i)=(d)(j)}M_{ij} = \frac{1}{D} \sum_{d=1}^D \mathbf{1}_{\{\ell^{(d)}(i) = \ell^{(d)}(j)\}}

permits alternative definitions of stability, using within- and between-cluster co-occurrences.

4. Computational and Practical Considerations

The computational cost is dominated by repeated K-Prototypes clusterings across perturbations, noise levels, and cluster sizes: O((KNT)DM+KΩ(NkT)DM)O(KΩNTDM)O\left( (K N T) D M + K |\Omega| (N_k T) D M \right) \approx O(K |\Omega| N T D M) with TT the number of iterations per clustering. In typical practice, Ω|\Omega| is 2–5, D5D \approx 5–10, and M10M \approx 10.

Empirical observations suggest D=1D = 1 or 2 already gives low-variance estimates, but D=5D = 5–10 is robust. The noise-level grid ϵ[0,ϵmax]\epsilon \in [0, \epsilon_{\max}] should cover up to ϵmaxpnum\epsilon_{\max} \approx \sqrt{p_{\mathrm{num}}}, so that at maximum noise all meaningful structure is dissolved and K=1K=1 emerges as optimal.

Initialization procedures (e.g., greedy, multiple restarts) should be consistent across runs to mitigate non-deterministic cluster assignments. Perturbations of the categorical component (bootstrapping or flipping) must be balanced: if category level proportions are highly imbalanced, bootstrapping is recommended.

5. Empirical Performance and Validation Approaches

Mourer et al. (Mourer et al., 2020) evaluated Stadion on 73 numerically-typed datasets (Gaussian mixtures, shapes, UCI benchmarks), comparing over 20 internal validation indices. Stadion-max with additive noise consistently ranked among the top two methods, outperforming conventional indices such as Silhouette, Dunn, Calinski-Harabasz (CH), Gap, and prior stability-based techniques.

For mixed-type evaluation, the prescribed protocol is:

  1. Assemble or simulate benchmark datasets with ground-truth clusters, including both synthetic (Gaussian numericals + Dirichlet-noise categorical prototypes) and real (e.g., UCI Adult, Heart Disease datasets).
  2. Apply Stability-Validated K-Prototypes as described above.
  3. Compare the selected K^\hat K to true KK^* (win counts), and compute ARI(CK^,CK)\mathrm{ARI}(\mathcal{C}_{\hat K}, \mathcal{C}_{K^*}).
  4. Use baselines tailored to mixed data: adaptations of Silhouette with Gower distance, WB-index with Gower, categorical CH, EM-based BIC, mixed-data indices (e.g., E–CV index), or mixed-consensus clustering.
  5. Inspect stability paths plotted against noise to confirm the expected dissolution of subclusters and cluster merges as perturbation increases.

The expected empirical outcome is that Stadion-max remains a top-performing internal criterion for mixed data, contingent on an appropriate balance between numerical noise and categorical perturbation.

6. Connections, Limitations, and Perspectives

Stability-based validation circumvents the ill-defined objectives of unsupervised clustering by appealing to reproducibility under controlled perturbations. The Stadion criterion subsumes previous notions of stability but overcomes the inadequacy of classical stability for detecting underestimation of KK by penalizing within-cluster stability. This framework is model-agnostic, but its adaptation to K-Prototypes is central for practical handling of heterogeneous attributes.

A plausible implication is that the explicit modeling of both global and local stability can extend to other mixed-type clustering algorithms, provided suitable perturbation and similarity definitions. However, the computational overhead is substantial, and the performance is governed by the adequacy of perturbation schemes and similarity measures for both data types. The method is inherently internal and does not leverage possible external or semi-supervised information.

The neutral point is: empirical results demonstrate top-tier performance for Stadion-max as an internal criterion, but its generalization to non-K-Prototypes algorithms or very high-dimensional mixed data remains to be explored (Mourer et al., 2020).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Whiteboard

Topic to Video (Beta)

Follow Topic

Get notified by email when new papers are published related to Stability-Validated K-Prototypes Clustering.