Distance and Similarity Measures Effect on the Performance of K-Nearest Neighbor Classifier -- A Review (1708.04321v3)

Published 14 Aug 2017 in cs.LG and cs.AI

Abstract: The K-nearest neighbor (KNN) classifier is one of the simplest and most common classifiers, yet its performance competes with the most complex classifiers in the literature. The core of this classifier depends mainly on measuring the distance or similarity between the tested examples and the training examples. This raises a major question about which distance measures to be used for the KNN classifier among a large number of distance and similarity measures available? This review attempts to answer this question through evaluating the performance (measured by accuracy, precision and recall) of the KNN using a large number of distance measures, tested on a number of real-world datasets, with and without adding different levels of noise. The experimental results show that the performance of KNN classifier depends significantly on the distance used, and the results showed large gaps between the performances of different distances. We found that a recently proposed non-convex distance performed the best when applied on most datasets comparing to the other tested distances. In addition, the performance of the KNN with this top performing distance degraded only about $20\%$ while the noise level reaches $90\%$, this is true for most of the distances used as well. This means that the KNN classifier using any of the top $10$ distances tolerate noise to a certain degree. Moreover, the results show that some distances are less affected by the added noise comparing to other distances.

Citations (349)

View on Semantic Scholar

Summary

The paper demonstrates that certain distance measures, especially the Hassanat distance, consistently yield robust KNN performance even under high noise levels.
The study systematically compares 54 metrics across 28 datasets, evaluating classifier accuracy, precision, and recall in both noise-free and noisy environments.
The findings offer practical guidance by highlighting the effectiveness of L1-based metrics and advocating for adaptive selection strategies in KNN classification.

Analysis of the Impact of Distance Measures on KNN Classifier Performance

The examined paper provides a comprehensive evaluation of K-nearest neighbor (KNN) classifier performance with respect to various distance measures. The paper highlights a longstanding, yet crucial question within the field of pattern classification: which distance measure should be employed to maximize the KNN classifier's efficacy? Through a meticulous comparison of 54 distance metrics across 28 distinct datasets, this paper offers a nuanced understanding of the profound impact that choice of distance measure can have on classification outcomes.

Evaluation Approach

The authors conducted experiments with both noise-free and noisy datasets, systematically injecting noise at levels ranging from 10% to 90%. They measured the classifier performance using accuracy, precision, and recall metrics. Furthermore, the authors contrasted these performances to establish which distance measures yielded superior results, under both clean and disturbed data conditions.

Observations on Distance Measures

The findings indicate that no single distance measure universally dominates across all datasets, which is consistent with the no-free-lunch theorem in optimization. However, certain distance measures, such as the non-convex Hassanat distance, consistently achieved high performance across various datasets. This particular distance measure demonstrated robustness against noise, maintaining some level of classification accuracy even as noise levels approached 90%. Notably, the Hassanat distance outperformed others in scenarios devoid of noise as well as in heavily noise-imbued datasets, taking advantage of its boundedness characteristic to neutralize the influence of outliers.

Additional observations revealed that the Manhattan, Canberra, and Lorentzian distances also often resulted in competitive classifier performance, albeit slightly less consistent than Hassanat across noise variations. Observations suggest that distances from the $L_1$ family incline toward better performance in scenarios constrained by noise, an insight that aligns with previously recognized advantages of $L_1$ measures in high-dimensional spaces or when datasets contain significant noise.

Implications and Future Perspectives

The paper's implications extend to both theoretical and practical realms. Theoretically, it underscores the intricate dependence between distance measures and learning algorithms' efficacy. Practically, it gives practitioners substantial guidance on selecting suitable distance measures for KNN tasks, specifically advocating for the Hassanat distance in scenarios where robustness against noise is requisite. The extensive experimental setup also provides a cogent foundation for further research into developing adaptive mechanisms within learning algorithms that dynamically adjust to optimal distance measures based on real-time data characteristics.

The depth of empirical investigations in this work opens avenues for future explorations. Firstly, incorporating additional distance measures not covered could reveal further insights. Secondly, classifications using variations of KNN or alternative machine learning models could be explored to generalize findings. Finally, investigating the intersection of distance measures with feature or instance selection methods might offer augmented classifier robustness or ameliorate computational costs, thereby further enhancing the applicability of KNN in big data contexts.

In conclusion, this scholarly effort elevates the discourse on KNN classifiers by meticulously scrutinizing the role of distance measures, ultimately providing both a reference architecture for evaluating classifier efficiency and practical takeaways for improving machine learning systems' robustness and accuracy.

PDF Markdown