SDC-HSDD-NDSA: Structure Detecting Cluster by Hierarchical Secondary Directed Differential with Normalized Density and Self-Adaption (2307.00677v5)
Abstract: Density-based clustering is the most popular clustering algorithm since it can identify clusters of arbitrary shape as long as they are separated by low-density regions. However, a high-density region that is not separated by low-density ones might also have different structures belonging to multiple clusters. As far as we know, all previous density-based clustering algorithms fail to detect such structures. In this paper, we provide a novel density-based clustering scheme to address this problem. It is the rst clustering algorithm that can detect meticulous structures in a high-density region that is not separated by low-density ones and thus extends the range of applications of clustering. The algorithm employs secondary directed differential, hierarchy, normalized density, as well as the self-adaption coefficient, called Structure Detecting Cluster by Hierarchical Secondary Directed Differential with Normalized Density and Self-Adaption, dubbed SDC-HSDD-NDSA. Experiments on synthetic and real datasets are implemented to verify the effectiveness, robustness, and granularity independence of the algorithm, and the scheme is compared to unsupervised schemes in the Python package Scikit-learn. Results demonstrate that our algorithm outperforms previous ones in many situations, especially significantly when clusters have regular internal structures. For example, averaging over the eight noiseless synthetic datasets with structures employing ARI and NMI criteria, previous algorithms obtain scores below 0.6 and 0.7, while the presented algorithm obtains scores higher than 0.9 and 0.95, respectively.
- L. Kaufman and P. J. Rdusseeun, “Clustering by means of medoids,” in Proceedings of the statistical data analysis based on the L1 norm conference, neuchatel, switzerland, vol. 31, 1987.
- R. T. Ng and W. Han, J, “Clarans: a method for clustering objects for spatial data mining,” IEEE Transactions on Knowledge and Data Engineering, vol. 14, no. 5, pp. 1003–1016, 2002.
- X. Y. Qin, K. M. Ting, Y. Zhu, and V. C. S. Lee, “Nearest-neighbour-induced isolation similarity and its impact on density-based clustering,” in Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, no. 01, 2019, pp. 4755–4762.
- G. Sheikholeslami, S. Chatterjee, and A. D. Zhang, “Wavecluster: a wavelet-based clustering approach for spatial data in very large databases,” The VLDB Journal, vol. 8, pp. 289–304, 2000.
- T. Chen, N. L. Zhang, T. F. Liu, K. M. Poon, and Y. Wang, “Model-based multidimensional clustering of categorical data,” Artificial Intelligence, vol. 176, no. 1, pp. 2246–2269, 2012. [Online]. Available: https://www.sciencedirect.com/science/article/pii/S000437021100110X
- J. F. Magland and A. H. Barnett, “Unimodal clustering using isotonic regression: Iso-split,” arXiv preprint arXiv:1508.04841, 2015.
- M. Ester, H. P. Kriegel, J. Sander, and X. W. Xu, “A density-based algorithm for discovering clusters in large spatial databases with noise,” in Proceedings of the Second International Conference on Knowledge Discovery and Data Mining, ser. KDD’96. AAAI Press, 1996, p. 226–231.
- M. Ankerst, M. M. Breunig, H. P. Kriegel, and J. Sander, “Optics: Ordering points to identify the clustering structure,” ACM Sigmod record, vol. 28, no. 2, pp. 49–60, 1999.
- S. Vadapalli, S. R. Valluri, and K. Karlapalem, “A simple yet effective data clustering algorithm,” in Sixth International Conference on Data Mining (ICDM’06), 2006, pp. 1108–1112.
- W. Jin, A. K. H. Tung, J. W. Han, and W. Wang, “Ranking outliers using symmetric neighborhood relationship,” in Advances in Knowledge Discovery and Data Mining: 10th Pacific-Asia Conference, PAKDD 2006, Singapore, April 9-12, 2006. Proceedings 10. Springer, 2006, pp. 577–593.
- A. Hinneburg and H. H. Gabriel, “Denclue 2.0: Fast clustering based on kernel density estimation,” in Advances in Intelligent Data Analysis VII: 7th International Symposium on Intelligent Data Analysis, IDA 2007, Ljubljana, Slovenia, September 6-8, 2007. Proceedings 7. Springer, 2007, pp. 70–80.
- A. Ram, S. Jalal, A. S. Jalal, and M. Kumar, “A density based algorithm for discovering density varied clusters in large spatial databases,” International Journal of Computer Applications, vol. 3, no. 6, pp. 1–4, 2010.
- R. J. G. B. Campello, D. Moulavi, and J. Sander, “Density-based clustering based on hierarchical density estimates,” in Advances in Knowledge Discovery and Data Mining: 17th Pacific-Asia Conference, PAKDD 2013, Gold Coast, Australia, April 14-17, 2013, Proceedings, Part II 17. Springer, 2013, pp. 160–172.
- A. Rodriguez and A. Laio, “Clustering by fast search and find of density peaks,” Science, vol. 344, no. 6191, pp. 1492–1496, 2014. [Online]. Available: https://www.science.org/doi/abs/10.1126/science.1242072
- R. J. G. B. Campello, D. Moulavi, A. Zimek, and J. Sander, “Hierarchical density estimates for data clustering, visualization, and outlier detection,” ACM Transactions on Knowledge Discovery from Data (TKDD), vol. 10, no. 1, pp. 1–51, 2015.
- R. Ding, Q. Wang, Y. N. Dang, Q. Fu, H. D. Zhang, and D. M. Zhang, “Yading: Fast clustering of large-scale time series data,” Proceedings of the VLDB Endowment, vol. 8, no. 5, pp. 473–484, 2015.
- Y. H. Lv, T. H. Ma, M. L. Tang, J. Cao, Y. Tian, A. Al-Dhelaan, and M. Al-Rodhaan, “An efficient and scalable density-based clustering algorithm for datasets with complex structures,” Neurocomputing, vol. 171, pp. 9–22, 2016. [Online]. Available: https://www.sciencedirect.com/science/article/pii/S0925231215008073
- L. McInnes and J. Healy, “Accelerated hierarchical density based clustering,” in 2017 IEEE International Conference on Data Mining Workshops (ICDMW). IEEE, nov 2017. [Online]. Available: https://doi.org/10.1109%2Ficdmw.2017.12
- M. M. R. Khan, M. A. B. Siddique, R. B. Arif, and M. R. Oishe, “Adbscan: Adaptive density-based spatial clustering of applications with noise for identifying clusters with varying densities,” in 2018 4th international conference on electrical engineering and information & communication technology (iCEEiCT). IEEE, 2018, pp. 107–111.
- A. Bryant and K. Cios, “Rnn-dbscan: A density-based clustering algorithm using reverse nearest neighbor density estimates,” IEEE Transactions on Knowledge and Data Engineering, vol. 30, no. 6, pp. 1109–1121, 2018.
- S. Chowdhury and R. C. de Amorim, “An efficient density-based clustering algorithm using reverse nearest neighbour,” in Intelligent Computing: Proceedings of the 2019 Computing Conference, Volume 2. Springer, 2019, pp. 29–42.
- X. Huang and Y. R. Gel, “Crad: Clustering with robust autocuts and depth,” in 2017 IEEE International Conference on Data Mining (ICDM), 2017, pp. 925–930.
- K. M. Ting, Y. Zhu, M. Carman, Y. Zhu, T. Washio, and Z. H. Zhou, “Lowest probability mass neighbour algorithms: relaxing the metric constraint in distance-based neighbourhood algorithms,” Machine Learning, vol. 108, pp. 331–376, 2019.
- J. Jang and H. Jiang, “Dbscan++: Towards fast and scalable density clustering,” in International conference on machine learning, vol. 97. PMLR, 09–15 Jun 2019, pp. 3019–3029. [Online]. Available: https://proceedings.mlr.press/v97/jang19a.html
- C. Malzer and M. Baum, “A hybrid approach to hierarchical density-based cluster selection,” in 2020 IEEE International Conference on Multisensor Fusion and Integration for Intelligent Systems (MFI), 2020, pp. 223–228.
- Y. Zhu, K. M. Ting, Y. Jin, and M. Angelova, “Hierarchical clustering that takes advantage of both density-peak and density-connectivity,” Information Systems, vol. 103, p. 101871, 2022. [Online]. Available: https://www.sciencedirect.com/science/article/pii/S0306437921000971
- Z. Q. Wang, Z. R. Ye, Y. Y. Du, Y. Mao, Y. Y. Liu, Z. L. Wu, and J. Wang, “Amd-dbscan: An adaptive multi-density dbscan for datasets of extremely variable density,” 2022.
- W. D. Zuo and X. M. Hou, “An improved probability propagation algorithm for density peak clustering based on natural nearest neighborhood,” Array, vol. 15, p. 100232, 2022. [Online]. Available: https://www.sciencedirect.com/science/article/pii/S2590005622000704
- W. Wang, J. Yang, and R. Muntz, “Sting: A statistical information grid approach to spatial data mining,” in Vldb, vol. 97, 1997, pp. 186–195.
- X. Y. Chen, Y. F. Min, Y. Zhao, and P. Wang, “Gmdbscan: Multi-density dbscan cluster based on grid,” in 2008 IEEE International Conference on e-Business Engineering, 2008, pp. 780–783.
- A. Foss, W. N. Wang, and O. R. Zaïane, “A non-parametric approach to web log analysis,” in Proc. of Workshop on Web Mining in First International SIAM Conference on Data Mining, 2001, pp. 41–50.
- C. Y. Xia, W. Hsu, M. L. Lee, and B. C. Ooi, “Border: efficient computation of boundary points,” IEEE Transactions on Knowledge and Data Engineering, vol. 18, no. 3, pp. 289–303, 2006.
- A. Moreira and M. Y. Santos, “Concave hull: A k-nearest neighbours approach for the computation of the region occupied by a set of points,” 2007.
- B. Z. Qiu, F. Yue, and J. Y. Shen, “Brim: an efficient boundary points detecting algorithm,” in Advances in Knowledge Discovery and Data Mining: 11th Pacific-Asia Conference, PAKDD 2007, Nanjing, China, May 22-25, 2007. Proceedings 11. Springer, 2007, pp. 761–768.
- Q. H. Tong, X. Li, and B. Yuan, “A highly scalable clustering scheme using boundary information,” Pattern Recognition Letters, vol. 89, pp. 1–7, 2017. [Online]. Available: https://www.sciencedirect.com/science/article/pii/S0167865517300247
- T. Zhang, R. Ramakrishnan, and M. Livny, “Birch: an efficient data clustering method for very large databases,” ACM sigmod record, vol. 25, no. 2, pp. 103–114, 1996.
- G. Karypis, E. H. Han, and V. Kumar, “Chameleon: hierarchical clustering using dynamic modeling,” Computer, vol. 32, no. 8, pp. 68–75, 1999.
- S. Guha, R. Rastogi, and K. Shim, “Cure: An efficient clustering algorithm for large databases,” ACM Sigmod record, vol. 27, no. 2, pp. 73–84, 1998.
- R. J. G. B. Campello, D. Moulavi, A. Zimek, and J. Sander, “A framework for semi-supervised and unsupervised optimal extraction of clusters from hierarchies,” Data Mining and Knowledge Discovery, vol. 27, pp. 344–371, 2013.
- M. M. Breunig, H. P. Kriegel, R. T. Ng, and J. Sander, “Lof: identifying density-based local outliers,” in Proceedings of the 2000 ACM SIGMOD international conference on Management of data, 2000, pp. 93–104.
- S. Papadimitriou, H. Kitagawa, P. B. Gibbons, and C. Faloutsos, “Loci: fast outlier detection using the local correlation integral,” in Proceedings 19th International Conference on Data Engineering (Cat. No.03CH37405), 2003, pp. 315–326.
- F. Korn and S. Muthukrishnan, “Influence sets based on reverse nearest neighbor queries,” ACM Sigmod Record, vol. 29, no. 2, pp. 201–212, 2000.
- Y. C. Zhao, C. Q. Zhang, and Y. D. Shen, “Clustering high-dimensional data with low-order neighbors,” in IEEE/WIC/ACM International Conference on Web Intelligence (WI’04), 2004, pp. 103–109.
- Y. B. He, H. Y. Tan, W. M. Luo, H. J. Mao, D. Ma, S. Z. Feng, and J. P. Fan, “Mr-dbscan: An efficient parallel density-based clustering algorithm using mapreduce,” in 2011 IEEE 17th International Conference on Parallel and Distributed Systems, 2011, pp. 473–480.
- J. H. Gan and Y. F. Tao, “Dbscan revisited: Mis-claim, un-fixability, and approximation,” in Proceedings of the 2015 ACM SIGMOD international conference on management of data, 2015, pp. 519–530.
- M. Bendechache, M. T. Kechadi, and N. A. Le-Khac, “Efficient large scale clustering based on data partitioning,” in 2016 IEEE International Conference on Data Science and Advanced Analytics (DSAA), 2016, pp. 612–621.
- Y. W. Chen, S. Y. Tang, N. Bouguila, C. Wang, J. X. Du, and H. L. Li, “A fast clustering algorithm based on pruning unnecessary distance computations in dbscan for high-dimensional data,” Pattern Recognition, vol. 83, pp. 375–387, 2018. [Online]. Available: https://www.sciencedirect.com/science/article/pii/S0031320318302103
- Y. W. Chen, L. D. Zhou, N. Bouguila, C. Wang, Y. Chen, and J. X. Du, “Block-dbscan: Fast clustering for large scale data,” Pattern Recognition, vol. 109, p. 107624, 2021. [Online]. Available: https://www.sciencedirect.com/science/article/pii/S0031320320304271