Statistical power for cluster analysis (2003.00381v3)

Published 1 Mar 2020 in stat.ML, cs.LG, and q-bio.QM

Abstract: Cluster algorithms are increasingly popular in biomedical research due to their compelling ability to identify discrete subgroups in data, and their increasing accessibility in mainstream software. While guidelines exist for algorithm selection and outcome evaluation, there are no firmly established ways of computing a priori statistical power for cluster analysis. Here, we estimated power and accuracy for common analysis pipelines through simulation. We varied subgroup size, number, separation (effect size), and covariance structure. We then subjected generated datasets to dimensionality reduction (none, multidimensional scaling, or UMAP) and cluster algorithms (k-means, agglomerative hierarchical clustering with Ward or average linkage and Euclidean or cosine distance, HDBSCAN). Finally, we compared the statistical power of discrete (k-means), "fuzzy" (c-means), and finite mixture modelling approaches (which include latent profile and latent class analysis). We found that outcomes were driven by large effect sizes or the accumulation of many smaller effects across features, and were unaffected by differences in covariance structure. Sufficient statistical power was achieved with relatively small samples (N=20 per subgroup), provided cluster separation is large ({\Delta}=4). Fuzzy clustering provided a more parsimonious and powerful alternative for identifying separable multivariate normal distributions, particularly those with slightly lower centroid separation ({\Delta}=3). Overall, we recommend that researchers 1) only apply cluster analysis when large subgroup separation is expected, 2) aim for sample sizes of N=20 to N=30 per expected subgroup, 3) use multidimensional scaling to improve cluster separation, and 4) use fuzzy clustering or finite mixture modelling approaches that are more powerful and more parsimonious with partially overlapping multivariate normal distributions.

Citations (225)

View on Semantic Scholar

Summary

The paper rigorously examines the statistical power and classification accuracy of cluster analysis methods using extensive simulation experiments.
Cluster separation is a primary driver of successful clustering outcomes, and dimensionality reduction techniques like MDS or UMAP can significantly impact this separation.
Achieving sufficient power often requires modest sample sizes (N=20-30 per subgroup) provided cluster separation is large, with fuzzy clustering offering advantages for partially overlapping groups.

Statistical Power for Cluster Analysis: A Detailed Examination

The paper "Statistical Power for Cluster Analysis" by Dalmaijer, Nord, and Astle provides a rigorous exploration into the statistical power inherent in cluster analysis, particularly within the field of biomedical research. This research is pivotal in understanding how well cluster analysis can identify discrete subgroups within data, a capability increasingly leveraged due to advancements in computational power and the widespread availability of clustering algorithms in software like Python and R.

The authors address a critical gap in the application of cluster analysis: the absence of firmly established guidelines for estimating a priori statistical power. Their work seeks to clarify the power estimates and classification accuracy for prevalent analysis pipelines through meticulous simulation experiments. Various configurations of subgroup size, number, separation, and covariance structures were systematically varied, followed by dimensionality reduction and clustering using algorithms such as k-means, agglomerative hierarchical clustering, and HDBSCAN.

Key Findings

Effect of Cluster Separation: The authors emphasize that large effect sizes or the cumulative effect of numerous smaller differences across features primarily drive clustering outcomes. Conversely, differences in covariance structure generally do not impact results. They demonstrate sufficient statistical power can be achieved with relatively modest sample sizes (N=20 per subgroup), provided the separation between clusters is considerable (e.g., effect size 4=4).
Dimensionality Reduction Techniques: Multi-dimensional scaling (MDS) and uniform manifold approximation and projection (UMAP) were evaluated for their impact on cluster separation. MDS tends to enhance cluster separation, whereas UMAP can reduce separation at lower effect sizes but significantly increase it when initial separation is already large.
Algorithm Comparison: Direct comparisons of discrete (e.g., k-means) and fuzzy (e.g., c-means) clustering show that fuzzy clustering methods can offer more parsimonious and powerful alternatives, particularly in scenarios with slightly lower centroid separation (e.g., 4=3). Fuzzy clustering achieved adequate power at lower separation thresholds, highlighting its potential advantage in multivariate normal distributions.

Practical and Theoretical Implications

The findings have several practical implications for researchers employing cluster analyses, particularly in biomedical contexts:

Application Conditions: Researchers should only consider using cluster analysis when a large separation between subgroups is anticipated. Otherwise, the efficacy of identifying true subcluster structures diminishes.
Sample Size Recommendations: For satisfactory statistical power and cluster detection, sample sizes of 20-30 observations per anticipated subgroup are recommended unless justified differently by specific data characteristics.
Preferred Methods: The use of fuzzy clustering or mixture modeling approaches is recommended, particularly when dealing with datasets where subgroups may exhibit partial overlap.

Future Directions

The paper paves the way for further exploration of non-linear dimensionality reduction impacts and the development of more robust power estimation techniques tailored to various clustering scenarios. The authors suggest that ongoing enhancements in clustering algorithms could enhance their applicability to real-world data, which often presents as less ideally separated than theoretical exemplars.

Moreover, the open-sourcing of the simulation code and data by the authors offers a valuable resource for researchers seeking to replicate or extend this paper, potentially employing other dimensional reduction techniques or clustering algorithms not covered in this work.

Overall, this paper offers a comprehensive examination of the factors affecting cluster analysis power, providing essential guidelines for practitioners aiming to apply these methods effectively in biomedical and other research areas.

PDF Markdown