Recovering the number of clusters in data sets with noise features using feature rescaling factors (1602.06989v1)

Published 22 Feb 2016 in stat.ML and cs.LG

Abstract: In this paper we introduce three methods for re-scaling data sets aiming at improving the likelihood of clustering validity indexes to return the true number of spherical Gaussian clusters with additional noise features. Our method obtains feature re-scaling factors taking into account the structure of a given data set and the intuitive idea that different features may have different degrees of relevance at different clusters. We experiment with the Silhouette (using squared Euclidean, Manhattan, and the p$^{th}$ power of the Minkowski distance), Dunn's, Calinski-Harabasz and Hartigan indexes on data sets with spherical Gaussian clusters with and without noise features. We conclude that our methods indeed increase the chances of estimating the true number of clusters in a data set.

Citations (312)

View on Semantic Scholar

Summary

The paper proposes three novel feature rescaling methods, including weighted K-Means variants, designed to improve accuracy when estimating the number of clusters (K) in datasets containing noise features.
Experimental results demonstrate that these feature rescaling techniques substantially enhance cluster number estimation accuracy and robustness, significantly mitigating noise feature impact even up to 100% noise.
This research provides valuable practical tools for performing robust cluster analysis in complex, high-dimensional data where noise is prevalent and suggests future work on parameter automation and application to other clustering paradigms.

Overview of Feature Rescaling for Cluster Number Recovery

The paper "Recovering the number of clusters in data sets with noise features using feature rescaling factors" introduces methods that address a core challenge in cluster analysis — accurately estimating the number of clusters (K) in a dataset, particularly when noise features are present. This challenge is significant given that many clustering algorithms, such as K-Means, require the number of clusters to be provided a priori and may perform suboptimally when irrelevant ("noise") features are included.

Methods and Approach

The authors propose three innovative methods involving feature rescaling to enhance the likelihood of clustering validity indexes identifying the correct number of spherical Gaussian clusters. These methods revolve around weighting features based on their relevance within individual clusters, assuming that different features carry different significance levels across different clusters.

Minkowski Weighted K-Means (MWK-Means): This method extends the K-Means algorithm by using a weighted version of the Minkowski distance. The algorithm assigns weights to features inversely related to their dispersion within each cluster. This is achieved using a general Minkowski metric, which not only accounts for potentially non-spherical cluster shapes but also interprets weights as feature scaling factors.
iMWK-Means with Explicit Rescaling: Here, the dataset and centroid values are explicitly rescaled, with the calculated weights directly affecting feature magnitudes. This method aims to modify the dataset such that clustering structures become more discernible.
iMWK-Means with Explicit Rescaling followed by K-Means: In this variant, after MWK-Means suggests an initial feature weighting, the explicitly rescaled dataset is re-clustered using a traditional K-Means approach. This combination leverages the strength of MWK-Means in identifying feature relevance and K-Means' robustness in partitioning.

The paper evaluates these methods with various clustering validation indices, including Silhouette, Dunn's, Calinski-Harabasz, and Hartigan indexes, in the presence of noise features. The experimental setup incorporates scenarios with increasing noise feature percentages to assess the robustness and efficacy of the proposed methods relative to baseline K-Means approaches.

Results and Discussion

The proposed feature rescaling methods, particularly with explicit rescaling followed by K-Means, significantly enhance cluster number estimation's accuracy and reduce relative errors across different noise scenarios. Notably, the potential distortion by noise features is effectively mitigated, with the algorithms showing substantial resilience up to 100% noise feature addition.

The paper highlights that the hyperparameter $p$ in the Minkowski distance critically affects performance, with values $p$ ranging from 1.4 to 1.8 generally yielding optimal results except for some cases where $p \geq 2$ was preferable with the Silhouette index in the Manhattan-distance configuration.

Implications and Future Directions

The implications of this work are twofold. Theoretically, it advances understanding of effective feature weighting and rescaling's role in clustering environments troubled by noise. Practically, it proposes a clear mechanism to improve clustering outcomes, which can be particularly beneficial for datasets with undefined cluster numbers or where noise features significantly outnumber relevant features.

Future research directions include automating the selection of the Minkowski parameter $p$ and improving initialization strategies to reduce the computational overhead of multiple K-Means runs. Furthermore, exploring the applicability of the rescaling methodology beyond K-Means to other clustering paradigms such as hierarchical or density-based methods appears promising.

In conclusion, this paper contributes a structured approach to one of the clustering's persistent challenges — robustly recovering the number of true clusters amidst noise. By leveraging intelligent feature rescaling, these methods provide powerful tools for researchers working with complex, high-dimensional data.

PDF Markdown