Hashing-Based Distributed Clustering for Massive High-Dimensional Data (2306.17417v1)
Abstract: Clustering analysis is of substantial significance for data mining. The properties of big data raise higher demand for more efficient and economical distributed clustering methods. However, existing distributed clustering methods mainly focus on the size of data but ignore possible problems caused by data dimension. To solve this problem, we propose a new distributed algorithm, referred to as Hashing-Based Distributed Clustering (HBDC). Motivated by the outstanding performance of hashing methods for nearest neighbor searching, this algorithm applies the learning-to-hash technique to the clustering problem, which possesses incomparable advantages for data storage, transmission and computation. Following a global-sub-site paradigm, the HBDC consists of distributed training of hashing network and spectral clustering for hash codes at the global site. The sub-sites use the learnable network as a hash function to convert massive HD original data into a small number of hash codes, and send them to the global site for final clustering. In addition, a sample-selection method and slight network structures are designed to accelerate the convergence of the hash network. We also analyze the transmission cost of HBDC, including the upper bound. Our experiments on synthetic and real datasets illustrate the superiority of HBDC compared with existing state-of-the-art algorithms.
- E. Januzaj, H.-P. Kriegel, and M. Pfeifle, “DBDC: Density based distributed clustering,” in Advances in Database Technology - EDBT 2004, E. Bertino, S. Christodoulakis, D. Plexousakis, V. Christophides, M. Koubarakis, K. Böhm, and E. Ferrari, Eds. Berlin, Heidelberg: Springer Berlin Heidelberg, 2004, pp. 88–105.
- A. Rosato, R. Altilio, and M. Panella, “Recent advances on distributed unsupervised learning,” in Advances in Neural Networks, S. Bassis, A. Esposito, F. C. Morabito, and E. Pasero, Eds. Cham: Springer International Publishing, 2016, pp. 77–86.
- W. Gan, J. C.-W. Lin, H.-C. Chao, and J. Zhan, “Data mining in distributed environment: a survey,” Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery, vol. 7, no. 6, p. e1216, 2017.
- P. Berkhin, “A survey of clustering data mining techniques,” Grouping multidimensional data: Recent advances in clustering, pp. 25–71, 2006.
- Y. A. Geng, Q. Li, M. Liang, C. Y. Chi, J. Tan, and H. Huang, “Local-density subspace distributed clustering for high-dimensional data,” IEEE Transactions on Parallel and Distributed Systems, vol. 31, no. 8, pp. 1799–1814, 2020.
- Y. Gu, S. Wang, H. Zhang, Y. Yao, and L. Liu, “Clustering-driven unsupervised deep hashing for image retrieval,” Neurocomputing, vol. 368, pp. 114–123, 2019.
- M. T. Krause, C. Nord, and P. Sparrow, “Text analysis in translation: Theory, methodology, and didactic application of a model for translation-oriented text analysis,” Modern Language Journal, vol. 76, no. 4, p. 581, 2005.
- G. Ji and X. Ling, “Ensemble learning based distributed clustering,” in Emerging Technologies in Knowledge Discovery and Data Mining: PAKDD 2007 International Workshops Nanjing, China, May 22-25, 2007 Revised Selected Papers 11. Springer, 2007, pp. 312–321.
- A. Amini, T. Y. Wah, M. R. Saybani, and S. R. A. S. Yazdi, “A study of density-grid based clustering algorithms on data streams,” in 2011 Eighth International Conference on Fuzzy Systems and Knowledge Discovery (FSKD), vol. 3. IEEE, 2011, pp. 1652–1656.
- C. Bouveyron and C. Brunet-Saumard, “Model-based clustering of high-dimensional data: A review,” Computational Statistics & Data Analysis, vol. 71, pp. 52–78, 2014.
- X. Luo, H. Wang, D. Wu, C. Chen, M. Deng, J. Huang, and X.-S. Hua, “A survey on deep hashing methods,” ACM Trans. Knowl. Discov. Data, vol. 17, no. 1, feb 2023.
- R. Cantini, F. Marozzo, G. Bruno, and P. Trunfio, “Learning sentence-to-hashtags semantic mapping for hashtag recommendation on microblogs,” ACM Transactions on Knowledge Discovery from Data (TKDD), vol. 16, no. 2, pp. 1–26, 2021.
- M. S. Charikar, “Similarity estimation techniques from rounding algorithms,” in Proceedings of the thiry-fourth annual ACM symposium on Theory of computing, 2002, pp. 380–388.
- K. M. Hammouda and M. S. Kamel, “Models of distributed data clustering in peer-to-peer environments,” Knowledge and information systems, vol. 38, pp. 303–329, 2014.
- W. Ni, G. Cheng, Y. Wu, and Z. Sun, “Local density based distributed clustering algorithm,” Journal of Software, vol. 19, no. 9, pp. 2339–2348, 2008.
- X. Xu, J. Jäger, and H.-P. Kriegel, “A fast parallel clustering algorithm for large spatial databases,” High Performance Data Mining: Scaling Algorithms, Applications and Systems, pp. 263–290, 2002.
- G. Jagannathan and R. N. Wright, “Privacy-preserving distributed k-means clustering over arbitrarily partitioned data,” in Proceedings of the eleventh ACM SIGKDD international conference on Knowledge discovery in data mining, 2005, pp. 593–599.
- M.-F. F. Balcan, S. Ehrlich, and Y. Liang, “Distributed k𝑘kitalic_k-means and k𝑘kitalic_k-median clustering on general topologies,” Advances in neural information processing systems, vol. 26, 2013.
- J. Jeong, B. Ryu, D. Shin, and D. Shin, “Integration of distributed biological data using modified k-means algorithm,” in Emerging Technologies in Knowledge Discovery and Data Mining: PAKDD 2007 International Workshops Nanjing, China, May 22-25, 2007 Revised Selected Papers 11. Springer, 2007, pp. 469–475.
- A. P. Dempster, N. M. Laird, and D. B. Rubin, “Maximum likelihood from incomplete data via the EM algorithm,” Journal of the royal statistical society: series B (methodological), vol. 39, no. 1, pp. 1–22, 1977.
- H.-P. Kriegel, P. Kroger, A. Pryakhin, and M. Schubert, “Effective and efficient distributed model-based clustering,” in Fifth IEEE International Conference on Data Mining (ICDM’05). IEEE, 2005, pp. 8–pp.
- R. Agrawal, J. Gehrke, D. Gunopulos, and P. Raghavan, “Automatic subspace clustering of high dimensional data for data mining applications,” in Proceedings of the 1998 ACM SIGMOD international conference on Management of data, 1998, pp. 94–105.
- S. Yang, L. Zhang, C. Xu, H. Yu, J. Fan, and Z. Xu, “Massive data clustering by multi-scale psychological observations,” National Science Review, vol. 9, no. 2, p. nwab183, 2022.
- Y. Gong, S. Kumar, V. Verma, and S. Lazebnik, “Angular quantization-based binary codes for fast similarity search,” Advances in neural information processing systems, vol. 25, 2012.
- J. Gui, T. Liu, Z. Sun, D. Tao, and T. Tan, “Fast supervised discrete hashing,” IEEE transactions on pattern analysis and machine intelligence, vol. 40, no. 2, pp. 490–496, 2017.
- A. Gionis, P. Indyk, R. Motwani et al., “Similarity search in high dimensions via hashing,” in Vldb, vol. 99, no. 6, 1999, pp. 518–529.
- C. Strecha, A. Bronstein, M. Bronstein, and P. Fua, “Ldahash: Improved matching with smaller descriptors,” IEEE transactions on pattern analysis and machine intelligence, vol. 34, no. 1, pp. 66–78, 2011.
- Z. Zhang, X. Zhu, G. Lu, and Y. Zhang, “Probability ordinal-preserving semantic hashing for large-scale image retrieval,” ACM Transactions on Knowledge Discovery from Data (TKDD), vol. 15, no. 3, pp. 1–22, 2021.
- L. Yuan, T. Wang, X. Zhang, F. E. Tay, Z. Jie, W. Liu, and J. Feng, “Central similarity quantization for efficient image and video retrieval,” in 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2020, pp. 3080–3089.
- H. Liu, R. Wang, S. Shan, and X. Chen, “Deep supervised hashing for fast image retrieval,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 2064–2072.
- H. Lai, Y. Pan, Y. Liu, and S. Yan, “Simultaneous feature learning and hash coding with deep neural networks,” in 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2015, pp. 3270–3278.
- Y. Shen, J. Qin, J. Chen, M. Yu, L. Liu, F. Zhu, F. Shen, and L. Shao, “Auto-encoding twin-bottleneck hashing,” in 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2020, pp. 2815–2824.
- M. Zieba, P. Semberecki, T. El-Gaaly, and T. Trzcinski, “Bingan: Learning compact binary descriptors with a regularized gan,” in Proceedings of the 32nd International Conference on Neural Information Processing Systems, ser. NIPS’18. Red Hook, NY, USA: Curran Associates Inc., 2018, p. 3612–3622.
- Y. K. Jang and N. I. Cho, “Self-supervised product quantization for deep unsupervised image retrieval,” in 2021 IEEE/CVF International Conference on Computer Vision (ICCV), 2021, pp. 12 065–12 074.
- S. Zheng, C. Shen, and X. Chen, “Design and analysis of uplink and downlink communications for federated learning,” IEEE Journal on Selected Areas in Communications, vol. 39, no. 7, pp. 2150–2167, 2020.
- A. Reisizadeh, A. Mokhtari, H. Hassani, A. Jadbabaie, and R. Pedarsani, “Fedpaq: A communication-efficient federated learning method with periodic averaging and quantization,” in International Conference on Artificial Intelligence and Statistics. PMLR, 2020, pp. 2021–2031.
- S. Xie, R. Girshick, P. Dollár, Z. Tu, and K. He, “Aggregated residual transformations for deep neural networks,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2017, pp. 1492–1500.
- U. Von Luxburg, “A tutorial on spectral clustering,” Statistics and computing, vol. 17, pp. 395–416, 2007.
- L. Hagen and A. B. Kahng, “New spectral methods for ratio cut partitioning and clustering,” IEEE transactions on computer-aided design of integrated circuits and systems, vol. 11, no. 9, pp. 1074–1085, 1992.
- J. Shi and J. Malik, “Normalized cuts and image segmentation,” IEEE Transactions on pattern analysis and machine intelligence, vol. 22, no. 8, pp. 888–905, 2000.
- L. M. Aouad, N. An-Lekhac, and T. Kechadi, “Grid-based approaches for distributed data mining applications,” Journal of Algorithms & Computational Technology, vol. 3, no. 4, pp. 517–534, 2009.
- A. Strehl and J. Ghosh, “Cluster ensembles—a knowledge reuse framework for combining multiple partitions,” Journal of machine learning research, vol. 3, no. Dec, pp. 583–617, 2002.
- “Hyperspectral remote sensing scenes,” https://www.ehu.eus/ccwintco/index.php/Hyperspectral_Remote_Sensing_Scenes, Last accessed on March. 2023.
- D. Dua and C. Graff, “UCI machine learning repository,” 2017. [Online]. Available: http://archive.ics.uci.edu/ml
- Y. LeCun, “Mnist handwriting dataset,” https://yann.lecun.com/exdb/mnist/, Last accessed on March. 2023.
- B. Bahmani, B. Moseley, A. Vattani, R. Kumar, and S. Vassilvitskii, “Scalable k-means++,” arXiv preprint arXiv:1203.6402, 2012.
- Y. Zhang, S. Chen, and G. Yu, “Efficient distributed density peaks for clustering large data sets in mapreduce,” IEEE Transactions on Knowledge and Data Engineering, vol. 28, no. 12, pp. 3218–3230, 2016.
- M. Liang, Q. Li, Y.-a. Geng, J. Wang, and Z. Wei, “Remold: an efficient model-based clustering algorithm for large datasets with spark,” in 2017 IEEE 23rd International Conference on Parallel and Distributed Systems (ICPADS). IEEE, 2017, pp. 376–383.