ECOD: Unsupervised Outlier Detection Using Empirical Cumulative Distribution Functions (2201.00382v3)

Published 2 Jan 2022 in cs.LG, cs.DB, stat.AP, and stat.ML

Abstract: Outlier detection refers to the identification of data points that deviate from a general data distribution. Existing unsupervised approaches often suffer from high computational cost, complex hyperparameter tuning, and limited interpretability, especially when working with large, high-dimensional datasets. To address these issues, we present a simple yet effective algorithm called ECOD (Empirical-Cumulative-distribution-based Outlier Detection), which is inspired by the fact that outliers are often the "rare events" that appear in the tails of a distribution. In a nutshell, ECOD first estimates the underlying distribution of the input data in a nonparametric fashion by computing the empirical cumulative distribution per dimension of the data. ECOD then uses these empirical distributions to estimate tail probabilities per dimension for each data point. Finally, ECOD computes an outlier score of each data point by aggregating estimated tail probabilities across dimensions. Our contributions are as follows: (1) we propose a novel outlier detection method called ECOD, which is both parameter-free and easy to interpret; (2) we perform extensive experiments on 30 benchmark datasets, where we find that ECOD outperforms 11 state-of-the-art baselines in terms of accuracy, efficiency, and scalability; and (3) we release an easy-to-use and scalable (with distributed support) Python implementation for accessibility and reproducibility.

Citations (220)

View on Semantic Scholar

Summary

The paper introduces ECOD, a novel method that uses empirical cumulative distribution functions for unsupervised outlier detection, achieving superior accuracy and scalability.
It employs a parameter-free, interpretable, and linear complexity strategy to compute tail probabilities across dimensions for robust outlier scoring.
Experimental results on 30 benchmarks show ECOD outperforms 11 state-of-the-art methods with higher ROC and average precision metrics.

Unsupervised Outlier Detection Using Empirical Cumulative Distribution Functions

This paper presents a novel approach to unsupervised outlier detection utilizing empirical cumulative distribution functions (ECDFs). Outlier detection is critical in numerous fields such as fraud detection, network intrusion, and social media analysis, where anomalous data can significantly skew analysis results. Traditional unsupervised methods, although prevalent, are often limited due to high computational costs, intricate hyperparameter tuning, and poor interpretability when applied to large, high-dimensional datasets.

The proposed method, ECOD (Empirical-Cumulative-distribution-based Outlier Detection), innovatively leverages ECDFs to detect outliers, characterized as rare events that occur in the distribution's tails. The algorithm estimates the data distribution nonparametrically by calculating the empirical cumulative distribution per dimension. Then, it evaluates tail probabilities for each data point, and aggregates these to compute an outlier score for each point.

ECOD is a parameter-free and interpretable method. It does not require hyperparameter tuning, which simplifies its application across different datasets. The interpretability is heightened through dimensional analysis, allowing practitioners to discern which data dimensions contribute most to outlier scores, facilitating data quality improvements by focusing on the relevant dimensions.

In rigorous experiments across 30 benchmark datasets, ECOD outperformed 11 state-of-the-art outlier detection algorithms in accuracy, scalability, and efficiency. Specifically, ECOD consistently achieved higher area under the receiver operating characteristic (ROC) curve and average precision scores compared to other methods. For example, in terms of ROC, ECOD scored 0.825 on average, surpassing the next best performer by 2%. This indicates robust performance even when benchmarked against algorithms like Isolation Forest, renowned for large-scale applications.

ECOD's time complexity is linear, O(nd), where n represents the number of data points and d the dimensions, underscoring its suitability for high-dimensional, large datasets—a remarkable achievement as it can process datasets of up to a million data points with 10,000 dimensions on a standard laptop within two hours.

Despite its strong performance, ECOD's accuracy may degrade when outliers are indistinguishable among normal data points across all dimensions or when they are centrally located in the distribution with no tail localization. This limitation suggests future research could explore incorporating feature interdependencies to enhance outlier detection in complex datasets.

The future potential for ECOD lies in further refinement to address the identified limitations and broaden its applicability, particularly integrating it with methodologies that capture feature dependencies. This paper's contributions significantly bolster the arsenal of tools available for unsupervised outlier detection, providing both theoretical and practical advancements in interpreting and managing large-scale data discrepancies effectively.

PDF Markdown

ECOD: Unsupervised Outlier Detection Using Empirical Cumulative Distribution Functions (2201.00382v3)

Summary

Unsupervised Outlier Detection Using Empirical Cumulative Distribution Functions

Related Papers