Dimensionality-Aware Outlier Detection: Theoretical and Experimental Analysis (2401.05453v2)
Abstract: We present a nonparametric method for outlier detection that takes full account of local variations in intrinsic dimensionality within the dataset. Using the theory of Local Intrinsic Dimensionality (LID), our 'dimensionality-aware' outlier detection method, DAO, is derived as an estimator of an asymptotic local expected density ratio involving the query point and a close neighbor drawn at random. The dimensionality-aware behavior of DAO is due to its use of local estimation of LID values in a theoretically-justified way. Through comprehensive experimentation on more than 800 synthetic and real datasets, we show that DAO significantly outperforms three popular and important benchmark outlier detection methods: Local Outlier Factor (LOF), Simplified LOF, and kNN.
- Z. Ahmad, A. S. Khan, C. W. Shiang, J. Abdullah, and F. Ahmad, “Network intrusion detection system: A systematic study of machine learning and deep learning approaches,” Trans. Emerg. Telecommun. Technol., vol. 32, no. 1, 2021.
- Z. Alaverdyan, J. Jung, R. Bouet, and C. Lartizien, “Regularized siamese neural network for unsupervised outlier detection on brain multiparametric magnetic resonance imaging: Application to epilepsy lesion screening,” Medical Image Anal., vol. 60, 2020.
- L. Amsaleg, J. Bailey, A. Barbe, S. M. Erfani, T. Furon, M. E. Houle, M. Radovanovic, and X. V. Nguyen, “High intrinsic dimensionality facilitates adversarial attack: Theoretical evidence,” IEEE Trans. Inf. Forensics Secur., vol. 16, pp. 854–865, 2021.
- L. Amsaleg, O. Chelly, T. Furon, S. Girard, M. E. Houle, K. Kawarabayashi, and M. Nett, “Extreme-value-theoretic estimation of local intrinsic dimensionality,” Data Min. Knowl. Discov., vol. 32, no. 6, pp. 1768–1805, 2018.
- L. Amsaleg, O. Chelly, M. E. Houle, K. Kawarabayashi, M. Radovanović, and W. Treeratanajaru, “Intrinsic dimensionality estimation within tight localities,” in Proc. SDM, 2019, pp. 181–189.
- ——, “Intrinsic dimensionality estimation within tight localities: A theoretical and experimental analysis,” arXiv, no. 2209.14475, 2022.
- A. Anderberg, J. Bailey, R. J. G. B. Campello, M. E. Houle, H. O. Marques, M. Radovanović, and A. Zimek, “Dimensionality-aware outlier detection,” in Proc. SDM, 2024.
- F. Angiulli and C. Pizzuti, “Fast outlier detection in high dimensional spaces,” in Proc. PKDD, 2002, pp. 15–26.
- L. Anselin, “Local indicators of spatial association–LISA,” Geograph. Anal., vol. 27, no. 2, pp. 93–115, 1995.
- M. Aumüller and M. Ceccarello, “The role of local dimensionality measures in benchmarking nearest neighbor search,” Inf. Syst., vol. 101, p. 101807, 2021.
- J. Bac, E. M. Mirkes, A. N. Gorban, I. Tyukin, and A. Y. Zinovyev, “Scikit-dimension: A python package for intrinsic dimension estimation,” Entropy, vol. 23, no. 10, p. 1368, 2021.
- J. Bailey, M. E. Houle, and X. Ma, “Local intrinsic dimensionality, entropy and statistical divergences,” Entropy, vol. 24, no. 9, p. 1220, 2022.
- V. Barnett, “The study of outliers: Purpose and model,” Appl. Stat., vol. 27, no. 3, pp. 242–250, 1978.
- K. Beyer, J. Goldstein, R. Ramakrishnan, and U. Shaft, “When is “nearest neighbor” meaningful?” in Proc. ICDT, 1999, pp. 217–235.
- M. M. Breunig, H.-P. Kriegel, R. Ng, and J. Sander, “LOF: Identifying density-based local outliers,” in Proc. SIGMOD, 2000, pp. 93–104.
- G. O. Campos, A. Zimek, J. Sander, R. J. G. B. Campello, B. Micenková, E. Schubert, I. Assent, and M. E. Houle, “On the evaluation of unsupervised outlier detection: Measures, datasets, and an empirical study,” Data Min. Knowl. Disc., vol. 30, pp. 891–927, 2016.
- G. Casanova, E. Englmeier, M. E. Houle, P. Kröger, M. Nett, E. Schubert, and A. Zimek, “Dimensional testing for reverse k𝑘kitalic_k-nearest neighbor search,” PVLDB, vol. 10, no. 7, pp. 769–780, 2017.
- A. Emmott, S. Das, T. Dietterich, A. Fern, and W.-K. Wong, “A meta-analysis of the anomaly detection problem,” arXiv, no. 1503.01158, 2016.
- E. Facco, M. d’Errico, A. Rodriguez, and A. Laio, “Estimating the intrinsic dimension of datasets by a minimal neighborhood information,” Scientific Reports, vol. 7, no. 12140, 2017.
- M. Goldstein and S. Uchida, “A comparative evaluation of unsupervised anomaly detection algorithms for multivariate data,” PLoS ONE, vol. 11, no. 4, 2016.
- S. Han, X. Hu, H. Huang, M. Jiang, and Y. Zhao, “Adbench: Anomaly detection benchmark,” in NeurIPS, 2022.
- B. M. Hill, “A simple general approach to inference about the tail of a distribution,” Annals Stat., vol. 3, no. 5, pp. 1163–1174, 1975.
- M. E. Houle, “Dimensionality, discriminability, density and distance distributions,” in Proc. ICDM Workshops, 2013, pp. 468–473.
- ——, “Local intrinsic dimensionality I: an extreme-value-theoretic foundation for similarity applications,” in Proc. SISAP, 2017, pp. 64–79.
- M. E. Houle, H.-P. Kriegel, P. Kröger, E. Schubert, and A. Zimek, “Can shared-neighbor distances defeat the curse of dimensionality?” in Proc. SSDBM, 2010, pp. 482–500.
- M. E. Houle, E. Schubert, and A. Zimek, “On the correlation between local intrinsic dimensionality and outlierness,” in Proc. SISAP, 2018, pp. 177–191.
- M. E. Houle, “Local intrinsic dimensionality II: multivariate analysis and distributional support,” in Proc. SISAP, 2017, pp. 80–95.
- ——, “Local intrinsic dimensionality III: density and similarity,” in Proc. SISAP, 2020, pp. 248–260.
- W. Jin, A. K. H. Tung, J. Han, and W. Wang, “Ranking outliers using symmetric neighborhood relationship,” in Proc. PAKDD, 2006, pp. 577–593.
- S. Kandanaarachchi, M. A. Muñoz, R. J. Hyndman, and K. Smith-Miles, “On normalization and algorithm selection for unsupervised outlier detection,” Data Min. Knowl. Discov., vol. 34, no. 2, pp. 309–354, 2020.
- D. R. Karger and M. Ruhl, “Finding nearest neighbors in growth-restricted metrics,” in Proc. STOC, 2002, pp. 741–750.
- E. M. Knorr and R. T. Ng, “A unified notion of outliers: Properties and computation,” in Proc. KDD, 1997, pp. 219–222.
- H.-P. Kriegel, P. Kröger, E. Schubert, and A. Zimek, “LoOP: local outlier probabilities,” in Proc. CIKM, 2009, pp. 1649–1652.
- ——, “Interpreting and unifying outlier scores,” in Proc. SDM, 2011, pp. 13–24.
- H.-P. Kriegel, M. Schubert, and A. Zimek, “Angle-based outlier detection in high-dimensional data,” in Proc. KDD, 2008, pp. 444–452.
- L. J. Latecki, A. Lazarevic, and D. Pokrajac, “Outlier detection with kernel density functions,” in Proc. MLDM, 2007, pp. 61–75.
- E. Levina and P. J. Bickel, “Maximum likelihood estimation of intrinsic dimension,” in Proc. NIPS, 2004, pp. 777–784.
- X. Ma, Y. Wang, M. E. Houle, S. Zhou, S. M. Erfani, S. Xia, S. N. R. Wijewickrema, and J. Bailey, “Dimensionality-driven learning with noisy labels,” in Proc. ICML, 2018, pp. 3361–3370.
- H. O. Marques, R. J. G. B. Campello, J. Sander, and A. Zimek, “Internal evaluation of unsupervised outlier detection,” ACM Trans. Knowl. Discov. Data, vol. 14, no. 4, pp. 47:1–47:42, 2020.
- H. O. Marques, L. Swersky, J. Sander, R. J. G. B. Campello, and A. Zimek, “On the evaluation of outlier detection and one-class classification: a comparative study of algorithms, model selection, and ensembles,” Data Min. Knowl. Discov., 2023.
- P. A. P. Moran, “Notes on continuous stochastic phenomena,” Biometrika, vol. 37, no. 1/2, pp. 17–23, 1950.
- S. Papadimitriou, H. Kitagawa, P. B. Gibbons, and C. Faloutsos, “LOCI: Fast outlier detection using the local correlation integral,” in Proc. ICDE, 2003, pp. 315–326.
- M. Radovanović, A. Nanopoulos, and M. Ivanović, “Reverse nearest neighbors in unsupervised distance-based outlier detection,” IEEE TKDE, 2014.
- S. Ramaswamy, R. Rastogi, and K. Shim, “Efficient algorithms for mining outliers from large data sets,” in Proc. SIGMOD, 2000, pp. 427–438.
- D. T. Ramotsoela, A. M. Abu-Mahfouz, and G. P. Hancke, “A survey of anomaly detection in industrial wireless sensor networks with critical water system infrastructure as a case study,” Sensors, vol. 18, no. 8, p. 2491, 2018.
- S. Rayana, “ODDS library,” 2016. [Online]. Available: http://odds.cs.stonybrook.edu
- S. Romano, O. Chelly, V. Nguyen, J. Bailey, and M. E. Houle, “Measuring dependency via intrinsic dimensionality,” in Proc. ICPR, 2016, pp. 1207–1212.
- E. Schubert, A. Zimek, and H.-P. Kriegel, “Generalized outlier detection with flexible kernel density estimates,” in Proc. SDM, 2014, pp. 542–550.
- ——, “Local outlier detection reconsidered: a generalized view on locality with applications to spatial, video, and network outlier detection,” Data Min. Knowl. Disc., vol. 28, no. 1, pp. 190–237, 2014.
- J. Tang, Z. Chen, A. W.-C. Fu, and D. W. Cheung, “Enhancing effectiveness of outlier detections for low density patterns,” in Proc. PAKDD, 2002, pp. 535–548.
- P. Tempczyk, R. Michaluk, L. Garncarek, P. Spurek, J. Tabor, and A. Golinski, “LIDL: local intrinsic dimension estimation using approximate likelihood,” in Proc. ICML, 2022, pp. 21 205–21 231.
- K. Zhang, M. Hutter, and H. Jin, “A new local distance-based outlier detection approach for scattered real-world data,” in Proc. PAKDD, 2009, pp. 813–822.
- A. Zimek, E. Schubert, and H.-P. Kriegel, “A survey on unsupervised outlier detection in high-dimensional numerical data,” Stat. Anal. Data Min., vol. 5, no. 5, pp. 363–387, 2012.
- A. Zimek and P. Filzmoser, “There and back again: Outlier detection between statistical reasoning and data mining algorithms,” WIREs Data Mining Knowl. Discov., vol. 8, no. 6, 2018.