Conditions under which one cluster validity measure supersedes others

Determine the data characteristics and clustering scenarios under which specific cluster validity indices—including the multinomial-distance-based measure C^K_{MN} proposed in this paper, the Caliński–Harabasz index, the Dunn index, connectivity, k-nearest-neighbor classification error rate, the gap statistic, and the kernel-based index M_{clus}—provide superior assessment of partition quality or identification of the number of clusters when the true clustering structure of a given sample is unknown.

Background

The paper introduces a new distribution-free clustering accuracy measure CK_{MN} based on the multinomial distribution applied to distances of cluster members from cluster representatives. It is designed to assess partition quality, estimate the number of clusters, and check for the existence of clustering (including the K=1 case).

Across extensive simulations and real-world case studies, the authors compare CK_{MN} with several widely used indices (Caliński–Harabasz, Dunn, connectivity, k-nearest-neighbor error rate, gap statistic, and a kernel-based index). They observe that different indices can yield conflicting recommendations, particularly in challenging real-world datasets, and note the practical difficulty that ground truth clusters are not known in such settings.

In this context, the authors explicitly state that it is unknown in which situations a particular measure will outperform others, highlighting a need to characterize the conditions under which specific indices are preferable given the lack of prior knowledge about true clustering structures.

References

In what situation a particular measure would supersede the others is unknown, because nothing is known regarding the true clusters underlying the given sample.

Quality check of a sample partition using multinomial distribution  (2404.07778 - Modak, 2024) in Summary, Section 3 (Case studies)