A New Local Distance-Based Outlier Detection Approach for Scattered Real-World Data (0903.3257v1)

Published 18 Mar 2009 in cs.LG and cs.IR

Abstract: Detecting outliers which are grossly different from or inconsistent with the remaining dataset is a major challenge in real-world KDD applications. Existing outlier detection methods are ineffective on scattered real-world datasets due to implicit data patterns and parameter setting issues. We define a novel "Local Distance-based Outlier Factor" (LDOF) to measure the {outlier-ness} of objects in scattered datasets which addresses these issues. LDOF uses the relative location of an object to its neighbours to determine the degree to which the object deviates from its neighbourhood. Properties of LDOF are theoretically analysed including LDOF's lower bound and its false-detection probability, as well as parameter settings. In order to facilitate parameter settings in real-world applications, we employ a top-n technique in our outlier detection approach, where only the objects with the highest LDOF values are regarded as outliers. Compared to conventional approaches (such as top-n KNN and top-n LOF), our method top-n LDOF is more effective at detecting outliers in scattered data. It is also easier to set parameters, since its performance is relatively stable over a large range of parameter values, as illustrated by experimental results on both real-world and synthetic datasets.

Citations (389)

View on Semantic Scholar

Summary

The paper introduces the Local Distance-based Outlier Factor (LDOF) that quantifies a point’s deviation using k-nearest neighbor distances to enhance outlier detection in scattered data.
It provides a theoretical framework analyzing LDOF's lower bound and demonstrates an exponential reduction in false detection probabilities with increasing neighborhood size.
Experimental results confirm LDOF's superior precision on both synthetic and real-world datasets compared to conventional k-NN and LOF methods.

A New Local Distance-Based Approach for Outlier Detection in Scattered Data

The paper by Zhang et al. presents a novel approach for outlier detection in data mining, addressing common challenges associated with detecting anomalies in scattered real-world datasets. Traditional methods often falter when confronted with such datasets due to ambiguous data patterns and the difficulty of setting parameters appropriately. This work introduces the Local Distance-based Outlier Factor (LDOF) to quantify the degree to which an object can be considered an outlier when compared to its neighbors.

Summary of Contributions

The paper identifies two main challenges with existing outlier detection methods: 1) the scattered distribution of real-world data, which resembles loosely bound mini-clusters rather than distinct clusters, leading to high false-detection rates; and 2) the practical difficulty in setting algorithm parameters without predefined datasets. Zhang et al. circumvent these issues by devising the LDOF, which evaluates the relative location of data points in relation to their neighbors. The theoretical insights presented include the analysis of LDOF's properties, encompassing its lower bound, false-detection probability, and recommendations for suitable parameter settings.

Methodology

Local Distance-based Outlier Factor (LDOF): This factor is calculated using local $k$ -nearest neighbors ( $k$ -NN), where the ratio of the distance of a point to its neighbors versus the internal distances within neighbors is evaluated. The efficacy of this metric surpasses conventional top- $n$ methods such as those employing $k$ -distance (KNN) and Local Outlier Factor (LOF).
Theoretical Insights: The authors offer a theoretical framework that establishes the lower bound for the LDOF, providing guidance on expected values under the assumption of continuous data distribution. Additionally, false-detection probabilities are shown to decrease exponentially with increasing $k$ , guiding the choice of neighborhood size.
Top- $n$ Framework: LDOF is computed in a top- $n$ context, allowing only the objects with the highest LDOF values to be classified as outliers, effectively making the process interactive and manageable for domain experts.

Implications and Experimental Results

The experimental validation demonstrates LDOF’s superior performance across different datasets compared to $k$ -NN and LOF, particularly for datasets with scattered distribution. Key benchmarks include:

Synthetic 2-D Data: LDOF achieves consistent 100% precision over a broad range of $k$ values. In contrast, KNN and LOF struggle, a consequence of mini-clusters influencing their outlier assessments adversely.
Real-world Datasets (WDBC and Shuttle): When applied to these datasets, LDOF sustains higher detection accuracy, notably as dimensional attributes vary and real outliers increase. Statistical significance tests further support that LDOF's precision statistically surpasses that of traditional methods.

Future Directions

By offering a robust framework for detecting outliers in scattered data, the authors pave the way for future research to enhance LDOF's applicability and accuracy in diverse real-world datasets. Potential developments could involve extending the method to handle more complex data structures or integrating LDOF into broader data processing pipelines for automated anomaly detection.

The research elucidated in Zhang et al.'s paper constitutes a precise and effective approach to outlier detection, providing an insightful contribution to the field of data mining and knowledge discovery.

PDF Markdown