Unsupervised Outlier Detection using Random Subspace and Subsampling Ensembles of Dirichlet Process Mixtures (2401.00773v3)
Abstract: Probabilistic mixture models are recognized as effective tools for unsupervised outlier detection owing to their interpretability and global characteristics. Among these, Dirichlet process mixture models stand out as a strong alternative to conventional finite mixture models for both clustering and outlier detection tasks. Unlike finite mixture models, Dirichlet process mixtures are infinite mixture models that automatically determine the number of mixture components based on the data. Despite their advantages, the adoption of Dirichlet process mixture models for unsupervised outlier detection has been limited by challenges related to computational inefficiency and sensitivity to outliers in the construction of outlier detectors. Additionally, Dirichlet process Gaussian mixtures struggle to effectively model non-Gaussian data with discrete or binary features. To address these challenges, we propose a novel outlier detection method that utilizes ensembles of Dirichlet process Gaussian mixtures. This unsupervised algorithm employs random subspace and subsampling ensembles to ensure efficient computation and improve the robustness of the outlier detector. The ensemble approach further improves the suitability of the proposed method for detecting outliers in non-Gaussian data. Furthermore, our method uses variational inference for Dirichlet process mixtures, which ensures both efficient and rapid computation. Empirical analyses using benchmark datasets demonstrate that our method outperforms existing approaches in unsupervised outlier detection.
- Aggarwal, C. C. (2013). Outlier ensembles: position paper. ACM SIGKDD Explorations Newsletter 14(2), 49–58.
- Aggarwal, C. C. (2017). Outlier Analysis (Second ed.). Springer.
- On the surprising behavior of distance metrics in high dimensional space. In Proceedings of the International Conference on Database Theory, pp. 420–434.
- Theoretical foundations and algorithms for outlier ensembles. ACM SIGKDD Explorations Newsletter 17(1), 24–47.
- Aggarwal, C. C. and P. S. Yu (2001). Outlier detection for high dimensional data. In Proceedings of the 2001 ACM SIGMOD International Conference on Management of Data, pp. 37–46.
- Exchangeability and Related Topics. Springer.
- A novel outlier detection method for multivariate data. IEEE Transactions on Knowledge and Data Engineering 34(9), 4052–4062.
- Nonparametric Bayesian background estimation for hyperspectral anomaly detection. Digital Signal Processing 111, 102993.
- Anomaly intrusion detection system using Gaussian mixture model. In Proceedings of the 3rd International Conference on Convergence and Hybrid Information Technology, pp. 1162–1167.
- Isolation-based anomaly detection using nearest-neighbor ensembles. Computational Intelligence 34(4), 968–998.
- Random projection in dimensionality reduction: applications to image and text data. In Proceedings of the 7th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 245–250.
- Bishop, C. M. and N. M. Nasrabadi (2006). Pattern Recognition and Machine Learning. Springer.
- Blackwell, D. and J. B. MacQueen (1973). Ferguson distributions via Pólya urn schemes. The Annals of Statistics 1(2), 353–355.
- Blei, D. M. and M. I. Jordan (2006). Variational inference for Dirichlet process mixtures. Bayesian Analysis 1(1), 121–143.
- Variational inference: A review for statisticians. Journal of the American Statistical Association 112(518), 859–877.
- Clustering very large databases using EM mixture models. In Proceedings of the 15th International Conference on Pattern Recognition, pp. 76–80.
- Breiman, L. (1996). Bagging predictors. Machine Learning 24(2), 123–140.
- Breiman, L. (2001). Random forests. Machine Learning 45(1), 5–32.
- LOF: identifying density-based local outliers. In Proceedings of the ACM SIGMOD International Conference on Management of Data, pp. 93–104.
- One-class SVM for learning in image retrieval. In Proceedings of the International Conference on Image Processing, pp. 34–37.
- Systematic construction of anomaly detection benchmarks from real data. In Proceedings of the 19th ACM SIGKDD Workshop on Outlier Detection and Description, pp. 16–21.
- Ferguson, T. S. (1973). A Bayesian analysis of some nonparametric problems. The Annals of Statistics, 209–230.
- Bayesian Data Analysis. Chapman and Hall/CRC.
- Propagation algorithms for variational Bayesian learning. Advances in Neural Information Processing Systems, 507–513.
- Outlier detection using k-nearest neighbor graph. In Proceedings of the 17th International Conference on Pattern Recognition, pp. 430–433.
- A survey of outlier detection methodologies. Artificial Intelligence Review 22(2), 85–126.
- Ishwaran, H. and L. F. James (2001). Gibbs sampling methods for stick-breaking priors. Journal of the American Statistical Association 96(453), 161–173.
- Jain, S. and R. M. Neal (2004). A split-merge Markov chain Monte Carlo procedure for the Dirichlet process mixture model. Journal of Computational and Graphical Statistics 13(1), 158–182.
- An introduction to variational methods for graphical models. Machine Learning 37, 183–233.
- Multiple hierarchical Dirichlet processes for anomaly detection in traffic. Computer Vision and Image Understanding 169, 28–39.
- HiCS: High contrast subspaces for density-based outlier ranking. In Proceedings of the 28th IEEE International Conference on Data Engineering, pp. 1037–1048.
- Auto-encoding variational Bayes. arXiv preprint arXiv:1312.6114.
- Knox, E. M. and R. T. Ng (1998). Algorithms for mining distance-based outliers in large datasets. In Proceedings of the International Conference on Very Large Data Bases, pp. 392–403.
- Outlier detection in axis-parallel subspaces of high dimensional data. In Proceedings of the Pacific-Asia Conference on Knowledge Discovery and Data Mining, pp. 831–838.
- Angle-based outlier detection in high-dimensional data. In Proceedings of the 14th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 444–452.
- Collapsed variational Dirichlet process mixture models. In Proceedings of the International Joint Conference on Artificial Intelligence, pp. 2796–2801.
- Anomaly detection in sea traffic-a comparison of the Gaussian mixture model and the kernel density estimator. In Proceedings of the 12th International Conference on Information Fusion, pp. 756–763.
- Feature bagging for outlier detection. In Proceedings of the 11th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 157–166.
- Anomaly detection via a Gaussian mixture model for flight operation and safety monitoring. Transportation Research Part C: Emerging Technologies 64, 45–57.
- COPOD: copula-based outlier detection. IEEE International Conference on Data Mining, 1118–1123.
- ECOD: Unsupervised outlier detection using empirical cumulative distribution functions. IEEE Transactions on Knowledge and Data Engineering.
- Isolation forest. In Proceedings of the 8th IEEE International Conference on Data Mining, pp. 413–422.
- Statistical selection of relevant subspace projections for outlier ranking. In Proceedings of the 27th IEEE International Conference on Data Engineering, pp. 434–445.
- Neal, R. M. (2000). Markov chain sampling methods for Dirichlet process mixture models. Journal of Computational and Graphical Statistics 9(2), 249–265.
- Pevnỳ, T. (2016). Loda: Lightweight on-line detector of anomalies. Machine Learning 102, 275–304.
- Efficient algorithms for mining outliers from large data sets. In Proceedings of the International Conference on Management of Data, pp. 427–438.
- Deep one-class classification. In International Conference on Machine Learning, pp. 4393–4402.
- Anomaly detection using autoencoders with nonlinear dimensionality reduction. In Proceedings of the 2nd Workshop on Machine Learning for Sensory Data Analysis, pp. 4–11.
- An online classification EM algorithm based on the mixture model. Statistics and Computing 17(3), 209–218.
- Estimating the support of a high-dimensional distribution. Neural Computation 13(7), 1443–1471.
- Sethuraman, J. (1994). A constructive definition of Dirichlet priors. Statistica Sinica, 639–650.
- Shotwell, M. S. and E. H. Slate (2011). Bayesian outlier detection with Dirichlet process mixtures. Bayesian Analysis 6(4), 665–690.
- A novel anomaly detection scheme based on principal component classifier. In Proceedings of the IEEE Foundations and New Directions of Data Mining Workshop, pp. 172–179.
- Cluster ensembles—a knowledge reuse framework for combining multiple partitions. Journal of Machine Learning Research 3(Dec), 583–617.
- A comprehensive survey of anomaly detection techniques for high dimensional big data. Journal of Big Data 7(1), 1–30.
- Active online anomaly detection using Dirichlet process mixture model and Gaussian process classification. In Proceedings of 2017 IEEE Winter Conference on Applications of Computer Vision, pp. 615–623.
- Fully unsupervised learning of Gaussian mixtures for anomaly detection in hyperspectral imagery. In Proceedings of the 9th International Conference on Intelligent Systems Design and Applications, pp. 596–601.
- Deep isolation forest for anomaly detection. IEEE Transactions on Knowledge and Data Engineering.
- Outlier detection with globally optimal exemplar-based GMM. In Proceedings of the 2009 SIAM International Conference on Data Mining, pp. 145–154.
- Findout: Finding outliers in very large datasets. Knowledge and Information Systems 4(4), 387–412.
- PyOD: A Python toolbox for scalable outlier detection. Journal of Machine Learning Research 20(96), 1–7.
- Zhou, Z.-H. (2012). Ensemble Methods: Foundations and Algorithms. CRC Press.
- Subsampling for efficient and effective unsupervised outlier detection ensembles. In Proceedings of the 19th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 428–436.