An Experimental Comparison of Several Clustering and Initialization Methods (1301.7401v2)

Published 30 Jan 2013 in cs.LG and stat.ML

Abstract: We examine methods for clustering in high dimensions. In the first part of the paper, we perform an experimental comparison between three batch clustering algorithms: the Expectation-Maximization (EM) algorithm, a winner take all version of the EM algorithm reminiscent of the K-means algorithm, and model-based hierarchical agglomerative clustering. We learn naive-Bayes models with a hidden root node, using high-dimensional discrete-variable data sets (both real and synthetic). We find that the EM algorithm significantly outperforms the other methods, and proceed to investigate the effect of various initialization schemes on the final solution produced by the EM algorithm. The initializations that we consider are (1) parameters sampled from an uninformative prior, (2) random perturbations of the marginal distribution of the data, and (3) the output of hierarchical agglomerative clustering. Although the methods are substantially different, they lead to learned models that are strikingly similar in quality.

Citations (219)

View on Semantic Scholar

Summary

The paper demonstrates that EM outperforms CEM and HAC in clustering quality based on log marginal likelihood and classification accuracy.
It shows that EM's soft assignment strategy more effectively escapes local optima compared to the deterministic approach of CEM.
The study finds that Marginal initialization boosts efficiency and cluster accuracy, making it a practical choice for high-dimensional data analysis.

An Experimental Comparison of Several Clustering and Initialization Methods

The paper "An Experimental Comparison of Several Clustering and Initialization Methods" by Marina Meilă and David Heckerman presents a thorough analysis of various clustering algorithms and initialization techniques, focusing predominantly on high-dimensional data. The core objective of the paper is to compare the efficiency and effectiveness of different batch clustering algorithms—namely Expectation-Maximization (EM), Classification EM (CEM), and model-based hierarchical agglomerative clustering (HAC)—in learning naive-Bayes models from high-dimensional discrete-variable datasets.

Overview of Clustering Algorithms

This paper elucidates the EM algorithm's efficacy in clustering, especially when compared to CEM and HAC. EM is found to be superior across various criteria, including log marginal likelihood, holdout likelihood, and classification accuracy, despite requiring more computational resources than CEM. Meanwhile, HAC emerged as slower and less effective than both EM and CEM.

Expectation-Maximization (EM): EM iteratively optimizes the likelihood by alternating between expectation and maximization steps. Unlike CEM, EM allows fractional cluster assignments, enabling it to escape local optima more effectively and resulting in a higher quality of clustering.
Classification EM (CEM): A variant of EM that employs a “winner take all” strategy similar to K-means. CEM's deterministic nature means it converges faster than EM but often at the expense of clustering quality, as revealed by its consistent underestimation of the number of clusters.
Model-based Hierarchical Agglomerative Clustering (HAC): HAC builds a hierarchical decomposition of the data set, agglomerating clusters based on a probabilistic model. While HAC does not require initialization, its computational inefficiency and less accurate clustering outcomes make it less preferable, particularly in higher dimensions.

Analysis of Initialization Methods

The paper presents a comparative analysis of three initialization methods for EM and CEM:

Random Initialization: Involves parameter sampling from a non-informative prior, performing worse than data-dependent methods.
Noisy-Marginal (Marginal): Uses maximum likelihood or maximum a posteriori estimates to adjust parameter distributions and yields better performance, with its major advantage being efficiency without losing cluster quality.
Hierarchical Agglomerative Clustering (HAC) Initialization: Even though HAC requires significant computation time, its initialization performance is comparable to Marginal, highlighting that the Marginal is the more practical choice for initial parameter estimation.

Experimental Insights

Experiments conducted using synthetic and real-world datasets (such as handwritten digits) underlined the strengths and weaknesses of the clustering algorithms and initialization procedures. Notably, EM's dominance is evident despite its slower processing compared to CEM, which aligns with the hypothesis that clustering quality correlates with "soft" assignment of cases. The empirical evidence presented supports the conclusion that the EM algorithm, combined with Marginal initialization, strikes an optimal balance between computational cost and clustering accuracy.

Implications and Future Directions

The findings from Meilă and Heckerman's research provide valuable insights for applying clustering techniques in high-dimensional data domains, suggesting a preference for EM with Marginal initialization. Future research directions could include exploring these algorithms' performance in more diverse datasets, particularly those with higher dimensions or containing continuous variables. Additionally, comparing EM with advanced variants, such as those incorporating conjugate-gradient acceleration, may offer further improvements in clustering efficacy.

In conclusion, this paper offers a comprehensive framework for understanding and applying clustering and initialization techniques in complex data analysis scenarios, providing a foundation for numerous applications in machine learning and data mining.

PDF Markdown