Papers

Topics

Authors

Recent

View all

Assistant

AI Research Assistant

Well-researched responses based on relevant abstracts and paper content.

Custom Instructions Pro

Preferences or requirements that you'd like Emergent Mind to consider when generating responses.

Gemini 2.5 Flash

Gemini 2.5 Flash 92 tok/s

Gemini 2.5 Pro 55 tok/s Pro

GPT-5 Medium 31 tok/s Pro

GPT-5 High 25 tok/s Pro

GPT-4o 109 tok/s Pro

Kimi K2 216 tok/s Pro

GPT OSS 120B 453 tok/s Pro

Claude Sonnet 4.5 37 tok/s Pro

2000 character limit reached

Effect of Different Distance Measures on the Performance of K-Means Algorithm: An Experimental Study in Matlab (1405.7471v1)

Published 29 May 2014 in cs.LG

Abstract: K-means algorithm is a very popular clustering algorithm which is famous for its simplicity. Distance measure plays a very important rule on the performance of this algorithm. We have different distance measure techniques available. But choosing a proper technique for distance calculation is totally dependent on the type of the data that we are going to cluster. In this paper an experimental study is done in Matlab to cluster the iris and wine data sets with different distance measures and thereby observing the variation of the performances shown.

Citations (123)

View on Semantic Scholar

Summary

An Analysis of Distance Measures in the K-Means Algorithm

The paper by Dibya Jyoti Bora and Dr. Anil Kumar Gupta provides an empirical evaluation of the impact of different distance measures on the performance of the K-Means algorithm, focusing specifically on clustering tasks within the Matlab environment. This paper is significant as it highlights the influence of distance computations on clustering outcomes, offering insights that can guide the selection of appropriate metrics based on dataset characteristics.

The K-Means algorithm, being a widely used partitional clustering method, partitions a dataset into $k$ clusters by minimizing the intra-cluster distance while maximizing the inter-cluster distance. Distance metrics are pivotal in assigning data points to the nearest centroids, thus influencing the clustering results significantly. The authors considered four distance measures in this paper: City Block (Manhattan), Euclidean, Cosine, and Correlation.

The experimental setup involved two widely recognized datasets: the Iris dataset, with components defined by four attributes related to iris plants, and the Wine dataset, consisting of 13 attributes resulting from chemical analysis of wines. The number of clusters, $k$ , was predetermined as 3 for the experiments, which aligns with the number of classes within these datasets.

Results from the Iris dataset revealed divergent performance across the distance metrics. For example, the best total sum of distances, which serves as the objective function to be minimized in K-Means, varied across distance measures with the Euclidean distance achieving the lowest value of 7897.88, indicating a compact clustering result. Meanwhile, cosine distance presented the highest computation time, suggesting a trade-off between clustering precision and computational efficiency. Similar trends were observed with the Wine dataset, where different measures led to different clustering efficiencies and times, emphasizing the need for careful metric selection.

Importantly, the silhouette values, a measure of how similar an object is to its own cluster compared to other clusters, suggested that the correlation distance measure provided superior interpretability of the clustered data. This outcome underscores that while the city block distance was computationally efficient, correlation metrics might offer better qualitative insights on cluster separability.

The research implies that while traditional metrics like Euclidean distance may yield efficient clustering in some contexts, alternative measures can enhance interpretability and are dataset-dependent. The dimensionality and inherent structure of datasets are critical considerations when selecting an appropriate distance measure for the K-Means algorithm. The paper also acknowledges the computational constraints posed by high-dimensional data, which could exacerbate with inappropriate metric selection, leading to a call for future exploration into additional distance measures and scalability across larger datasets.

The authors suggest extending their inquiry to other partitional clustering algorithms such as K-Medoids, CLARA, and CLARANS. These proposals hint at a broader applicability of their findings and potential advancements in clustering large-scale and high-dimensional data, projecting future developments in optimizing cluster analysis within AI fields. This systematic analysis allows researchers to make informed decisions on metric selection, thereby improving clustering results and computational viability in exploratory data analysis tasks.