Estimating the Contamination Factor's Distribution in Unsupervised Anomaly Detection (2210.10487v2)
Abstract: Anomaly detection methods identify examples that do not follow the expected behaviour, typically in an unsupervised fashion, by assigning real-valued anomaly scores to the examples based on various heuristics. These scores need to be transformed into actual predictions by thresholding, so that the proportion of examples marked as anomalies equals the expected proportion of anomalies, called contamination factor. Unfortunately, there are no good methods for estimating the contamination factor itself. We address this need from a Bayesian perspective, introducing a method for estimating the posterior distribution of the contamination factor of a given unlabeled dataset. We leverage on outputs of several anomaly detectors as a representation that already captures the basic notion of anomalousness and estimate the contamination using a specific mixture formulation. Empirically on 22 datasets, we show that the estimated distribution is well-calibrated and that setting the threshold using the posterior mean improves the anomaly detectors' performance over several alternative methods. All code is publicly available for full reproducibility.
- Afsari, B. Riemannian lpsuperscript𝑙𝑝l^{p}italic_l start_POSTSUPERSCRIPT italic_p end_POSTSUPERSCRIPT center of mass: existence, uniqueness, and convexity. Proceedings of the American Mathematical Society, 139(2):655–673, 2011.
- Aggarwal, C. C. An introduction to outlier analysis. In Outlier analysis, pp. 1–34. Springer, 2017.
- Alrawashdeh, M. J. An adjusted Grubbs’ and Generalized Extreme Studentized Deviation. Demonstratio Mathematica, 54(1):548–557, 2021.
- Fast and exact outlier detection in metric spaces: a proximity graph-based approach. In Proceedings of the 2021 International Conference on Management of Data, pp. 36–48, 2021.
- Fast outlier detection in high dimensional spaces. In European conference on principles of data mining and knowledge discovery, pp. 15–27. Springer, 2002.
- Periodicity Detection of Outlier Sequences using Constraint Based Pattern Tree with MAD. International Journal of Advanced Studies in Computers, Science and Engineering, 4(6):34, 2015.
- Multiple outlier detection tests for parametric models. Mathematics, 8(12):2156, 2020.
- A new non-parametric detector of univariate outliers for distributions with unbounded support. Extremes, 20(4):751–775, 2017.
- Variational Inference: A review for statisticians. Journal of the American statistical Association, 112(518):859–877, 2017.
- Chauvenet’s test in the classical theory of errors. Theory of Probability & Its Applications, 19(4):683–692, 1975.
- LOF: identifying density-based local outliers. In Proceedings of the 2000 ACM SIGMOD international conference on Management of data, pp. 93–104, 2000.
- Handbook of Markov Chain Monte Carlo. CRC press, 2011.
- Ordinal regression models in psychology: A tutorial. Advances in Methods and Practices in Psychological Science, 2(1):77–101, 2019.
- On the evaluation of unsupervised outlier detection: measures, datasets, and an empirical study. Data mining and knowledge discovery, 30(4):891–927, 2016.
- Anomaly Detection: A survey. ACM computing surveys (CSUR), 41(3):1–58, 2009.
- Autoencoder-based network Anomaly Detection. In 2018 Wireless telecommunications symposium (WTS), pp. 1–5. IEEE, 2018.
- Coin, D. Testing normality in the presence of outliers. Statistical Methods and Applications, 17(1):3–12, 2008.
- Kernel Stick-Breaking processes. Biometrika, 95(2):307–323, 2008.
- A meta-analysis of the Anomaly Detection problem. arXiv preprint arXiv:1503.01158, 2015.
- Ferguson, T. S. A Bayesian analysis of some nonparametric problems. The annals of statistics, pp. 209–230, 1973.
- Anomaly Detection: how to artificially increase your f1subscript𝑓1f_{1}italic_f start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT-score with a biased evaluation protocol. In Joint European Conference on Machine Learning and Knowledge Discovery in Databases, pp. 3–18. Springer, 2021.
- Elliptical insights: understanding statistical methods through elliptical geometry. Statistical Science, 28(1):1–39, 2013.
- Histogram-based outlier score (HBOS): A fast unsupervised Anomaly Detection algorithm. KI-2012: poster and demo track, 9, 2012.
- A comparative evaluation of unsupervised Anomaly Detection algorithms for multivariate data. PloS one, 11(4):e0152173, 2016.
- Dirichlet Process Gaussian Mixture Models: Choice of the base distribution. Journal of Computer Science and Technology, 25(4):653–664, 2010.
- A probabilistic interpretation of precision, recall and f𝑓fitalic_f-score, with implication for evaluation. In Advances in Information Retrieval: 27th European Conference on IR Research, ECIR 2005, Santiago de Compostela, Spain, March 21-23, 2005. Proceedings 27, pp. 345–359. Springer, 2005.
- Modelling heterogeneity with and without the Dirichlet Process. Scandinavian journal of statistics, 28(2):355–375, 2001.
- ADBench: Anomaly Detection Benchmark. arXiv preprint arXiv:2206.09426, 2022.
- Filtering approaches for dealing with noise in Anomaly Detection. In 2019 IEEE 58th Conference on Decision and Control (CDC), pp. 5356–5361. IEEE, 2019.
- Bayesian Anomaly Detection methods for social networks. The Annals of Applied Statistics, 4, 2010.
- IoT Anomaly Detection Based on Autoencoder and Bayesian Gaussian Mixture Model. Electronics, 11(20):3287, 2022.
- Deterministic and quasi-random sampling of optimized Gaussian Mixture distributions for Vibronic Monte Carlo. arXiv preprint arXiv:1912.11594, 2019.
- Robust inside-outside segmentation using generalized winding numbers. ACM Transactions on Graphics (TOG), 32(4):1–12, 2013.
- Using the mollifier method to characterize datasets and models: the case of the universal soil loss equation. ITC Journal, 3(4):263–272, 1997.
- Auto-encoding Variational Bayes. arXiv preprint arXiv:1312.6114, 2013.
- COPOD: copula-based outlier detection. In 2020 IEEE International Conference on Data Mining (ICDM), pp. 1118–1123. IEEE, 2020.
- Lin, J. On the Dirichlet distribution. Department of Mathematics and Statistics, Queens University, 2016.
- Isolation-based Anomaly Detection. ACM Transactions on Knowledge Discovery from Data (TKDD), 6(1):1–39, 2012.
- An empirical evaluation of Deep Learning for Network Anomaly Detection. In 2018 International Conference on Computing, Networking and Communications (ICNC), pp. 893–898. IEEE, 2018.
- Anomaly Detection based on sensor data in petroleum industry applications. Sensors, 15(2):2774–2797, 2015.
- An evaluation of bootstrap methods for outlier detection in least squares regression. Journal of Applied Statistics, 33(7):703–720, 2006.
- Benchmarking anomaly-based detection systems. In Proceeding International Conference on Dependable Systems and Networks. DSN 2000, pp. 623–630. IEEE, 2000.
- Neal, R. M. Bayesian Mixture Modeling. In Maximum Entropy and Bayesian Methods, pp. 197–211. Springer, 1992.
- Nydick, S. W. The Wishart and inverse Wishart distributions. Electronic Journal of Statistics, 6(1-19), 2012.
- Class prior estimation in active positive and unlabeled learning. In Proceedings of the 29th International Joint Conference on Artificial Intelligence and the 17th Pacific Rim International Conference on Artificial Intelligence (IJCAI-PRICAI 2020), pp. 2915–2921. IJCAI-PRICAI, 2020a.
- Quantifying the confidence of anomaly detectors in their example-wise predictions. In Joint European Conference on Machine Learning and Knowledge Discovery in Databases, pp. 227–243. Springer, 2020b.
- Transferring the Contamination Factor between Anomaly Detection Domains by Shape Similarity. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 36, pp. 4128–4136, 2022.
- Pevnỳ, T. LODA: Lightweight on-line detector of anomalies. Machine Learning, 102(2):275–304, 2016.
- Iterative gradient descent for outlier detection. International Journal of Wavelets, Multiresolution and Information Processing, 19(04):2150004, 2021.
- Rasmussen, C. The infinite Gaussian Mixture Model. Advances in neural information processing systems, 12, 1999.
- Transforming variables to central normality. Machine Learning, pp. 1–23, 2021.
- A robust AUC maximization framework with simultaneous outlier detection and feature selection for positive-unlabeled classification. IEEE transactions on neural networks and learning systems, 30(10):3072–3083, 2018.
- Towards a more reliable interpretation of machine learning outputs for safety-critical systems using feature importance fusion. Applied Sciences, 11(24):11854, 2021.
- Bayesian Anomaly Detection and Classification. arXiv preprint arXiv:1902.08627, 2019.
- Bayesian approaches to Gaussian Mixture Modeling. IEEE Transactions on Pattern Analysis and Machine Intelligence, 20(11):1133–1142, 1998.
- Estimating the support of a high-dimensional distribution. Neural computation, 13(7):1443–1471, 2001.
- Transductive Anomaly Detection. Technical report, Tech. Rep., 2008, http://www. eecs. umich. edu/cscott, 2008.
- A new prior for Bayesian Anomaly Detection. Methods of Information in Medicine, 49(01):44–53, 2010.
- The effect of hyperparameter tuning on the comparative evaluation of unsupervised Anomaly Detection methods. In Proceedings of the KDD, volume 21, pp. 1–9, 2021.
- Effective Histogram Thresholding Techniques for Natural Images Using Segmentation. Journal of Image and Graphics, 2(2):113–116, 2014.
- Transductgan: a Transductive Adversarial Model for Novelty Detection. arXiv e-prints, pp. arXiv–2203, 2022.
- On accurate and reliable Anomaly Detection for gas turbine combustors: A deep learning approach. arXiv preprint arXiv:1908.09238, 2019.
- Online wind turbine fault detection through automated SCADA data analysis. Wind Energy: An International Journal for Progress and Applications in Wind Power Conversion Technology, 12(6):574–593, 2009.
- Ice detection model of wind turbine blades based on Random Forest classifier. Energies, 11(10):2548, 2018.
- LSCP: Locally selective combination in parallel outlier ensembles. In Proceedings of the 2019 SIAM International Conference on Data Mining, pp. 585–593. SIAM, 2019a.
- PyOD: A Python Toolbox for Scalable Outlier Detection. Journal of Machine Learning Research, 20:1–7, 2019b.
- Deep Autoencoding Gaussian Mixture Model for unsupervised Anomaly Detection. In International conference on learning representations, 2018.