Imbalanced Data Clustering using Equilibrium K-Means
Abstract: Centroid-based clustering algorithms, such as hard K-means (HKM) and fuzzy K-means (FKM), have suffered from learning bias towards large clusters. Their centroids tend to be crowded in large clusters, compromising performance when the true underlying data groups vary in size (i.e., imbalanced data). To address this, we propose a new clustering objective function based on the Boltzmann operator, which introduces a novel centroid repulsion mechanism, where data points surrounding the centroids repel other centroids. Larger clusters repel more, effectively mitigating the issue of large cluster learning bias. The proposed new algorithm, called equilibrium K-means (EKM), is simple, alternating between two steps; resource-saving, with the same time and space complexity as FKM; and scalable to large datasets via batch learning. We substantially evaluate the performance of EKM on synthetic and real-world datasets. The results show that EKM performs competitively on balanced data and significantly outperforms benchmark algorithms on imbalanced data. Deep clustering experiments demonstrate that EKM is a better alternative to HKM and FKM on imbalanced data as more discriminative representation can be obtained. Additionally, we reformulate HKM, FKM, and EKM in a general form of gradient descent and demonstrate how this general form facilitates a uniform study of K-means algorithms.
- B. Krawczyk, “Learning from imbalanced data: open challenges and future directions,” Progress in Artificial Intelligence, vol. 5, no. 4, pp. 221–232, 2016.
- H. He and E. A. Garcia, “Learning from imbalanced data,” IEEE Transactions on knowledge and data engineering, vol. 21, no. 9, pp. 1263–1284, 2009.
- Y. Tang, Y.-Q. Zhang, N. V. Chawla, and S. Krasser, “Svms modeling for highly imbalanced classification,” IEEE Transactions on Systems, Man, and Cybernetics, Part B (Cybernetics), vol. 39, no. 1, pp. 281–288, 2008.
- C. Huang, Y. Li, C. C. Loy, and X. Tang, “Learning deep representation for imbalanced classification,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 5375–5384.
- Y. Lu, Y.-M. Cheung, and Y. Y. Tang, “Self-adaptive multiprototype-based competitive learning approach: A k-means-type algorithm for imbalanced data clustering,” IEEE transactions on cybernetics, vol. 51, no. 3, pp. 1598–1612, 2019.
- W. Press, S. Teukolsky, W. Vetterling, and B. Flannery, “Gaussian mixture models and k-means clustering,” Numerical recipes: the art of scientific computing, pp. 842–850, 2007.
- E. Shireman, D. Steinley, and M. J. Brusco, “Examining the effect of initialization strategies on the performance of gaussian mixture modeling,” Behavior research methods, vol. 49, pp. 282–293, 2017.
- J. MacQueen et al., “Some methods for classification and analysis of multivariate observations,” in Proceedings of the fifth Berkeley symposium on mathematical statistics and probability, vol. 1, no. 14. Oakland, CA, USA, 1967, pp. 281–297.
- S. Lloyd, “Least squares quantization in pcm,” IEEE transactions on information theory, vol. 28, no. 2, pp. 129–137, 1982.
- A. Coates and A. Y. Ng, “Learning feature representations with k-means,” in Neural Networks: Tricks of the Trade: Second Edition. Springer, 2012, pp. 561–580.
- B. Yang, X. Fu, N. D. Sidiropoulos, and M. Hong, “Towards k-means-friendly spaces: Simultaneous deep learning and clustering,” in international conference on machine learning. PMLR, 2017, pp. 3861–3870.
- M. Caron, P. Bojanowski, A. Joulin, and M. Douze, “Deep clustering for unsupervised learning of visual features,” in Proceedings of the European conference on computer vision (ECCV), 2018, pp. 132–149.
- M. M. Fard, T. Thonet, and E. Gaussier, “Deep k-means: Jointly clustering with k-means and learning representations,” Pattern Recognition Letters, vol. 138, pp. 185–192, 2020.
- N. R. Pal, K. Pal, J. M. Keller, and J. C. Bezdek, “A possibilistic fuzzy c-means clustering algorithm,” IEEE transactions on fuzzy systems, vol. 13, no. 4, pp. 517–530, 2005.
- D.-M. Tsai and C.-C. Lin, “Fuzzy c-means based clustering for linearly and nonlinearly separable data,” Pattern recognition, vol. 44, no. 8, pp. 1750–1760, 2011.
- S. Krinidis and V. Chatzis, “A robust fuzzy local information c-means clustering algorithm,” IEEE transactions on image processing, vol. 19, no. 5, pp. 1328–1337, 2010.
- S. Askari, “Fuzzy c-means clustering algorithm for data with unequal cluster sizes and contaminated with noise and outliers: Review and development,” Expert Systems with Applications, vol. 165, p. 113856, 2021.
- J. Noordam, W. Van Den Broek, and L. Buydens, “Multivariate image segmentation with cluster size insensitive fuzzy c-means,” Chemometrics and intelligent laboratory systems, vol. 64, no. 1, pp. 65–78, 2002.
- P.-L. Lin, P.-W. Huang, C.-H. Kuo, and Y. Lai, “A size-insensitive integrity-based fuzzy c-means method for data clustering,” Pattern Recognition, vol. 47, no. 5, pp. 2042–2056, 2014.
- J. Liang, L. Bai, C. Dang, and F. Cao, “The k𝑘kitalic_k-means-type algorithms versus imbalanced data distributions,” IEEE Transactions on Fuzzy Systems, vol. 20, no. 4, pp. 728–745, 2012.
- S. Zeng, X. Duan, J. Bai, W. Tao, K. Hu, and Y. Tang, “Soft multi-prototype clustering algorithm via two-layer semi-nmf,” IEEE Transactions on Fuzzy Systems, 2023.
- W. Wiharto and E. Suryani, “The comparison of clustering algorithms k-means and fuzzy c-means for segmentation retinal blood vessels,” Acta Informatica Medica, vol. 28, no. 1, p. 42, 2020.
- A. A.-h. Hassan, W. M. Shah, M. F. I. Othman, and H. A. H. Hassan, “Evaluate the performance of k-means and the fuzzy c-means algorithms to formation balanced clusters in wireless sensor networks,” Int. J. Electr. Comput. Eng, vol. 10, no. 2, pp. 1515–1523, 2020.
- D. Aloise, A. Deshpande, P. Hansen, and P. Popat, “Np-hardness of euclidean sum-of-squares clustering,” Machine learning, vol. 75, pp. 245–248, 2009.
- S. Ghosh and S. K. Dubey, “Comparative analysis of k-means and fuzzy c-means algorithms,” International Journal of Advanced Computer Science and Applications, vol. 4, no. 4, 2013.
- N. B. Karayiannis, “Meca: Maximum entropy clustering algorithm,” in Proceedings of 1994 IEEE 3rd international fuzzy systems conference. IEEE, 1994, pp. 630–635.
- L. Bottou and Y. Bengio, “Convergence properties of the k-means algorithms,” Advances in neural information processing systems, vol. 7, 1994.
- L. Groll and J. Jakel, “A new convergence proof of fuzzy c-means,” IEEE Transactions on Fuzzy Systems, vol. 13, no. 5, pp. 717–720, 2005.
- K. Zhou and S. Yang, “Effect of cluster size distribution on clustering: a comparative study of k-means and fuzzy c-means clustering,” Pattern Analysis and Applications, vol. 23, pp. 455–466, 2020.
- A. Gupta, S. Datta, and S. Das, “Fuzzy clustering to identify clusters at different levels of fuzziness: An evolutionary multiobjective optimization approach,” IEEE transactions on cybernetics, vol. 51, no. 5, pp. 2601–2611, 2019.
- D. Arthur and S. Vassilvitskii, “K-means++ the advantages of careful seeding,” in Proceedings of the eighteenth annual ACM-SIAM symposium on Discrete algorithms, 2007, pp. 1027–1035.
- M. Huang, Z. Xia, H. Wang, Q. Zeng, and Q. Wang, “The range of the value for the fuzzifier of the fuzzy c-means algorithm,” Pattern Recognition Letters, vol. 33, no. 16, pp. 2280–2284, 2012.
- R. A. Fisher, “The use of multiple measurements in taxonomic problems,” Annals of eugenics, vol. 7, no. 2, pp. 179–188, 1936.
- E. Anderson, “The species problem in iris,” Annals of the Missouri Botanical Garden, vol. 23, no. 3, pp. 457–509, 1936.
- W. Wolberg, M. Olvi, N. Street, and W. Street, “Breast Cancer Wisconsin (Diagnostic),” UCI Machine Learning Repository, 1995, DOI: https://doi.org/10.24432/C5DW2B.
- S. Chawla and A. Gionis, “k-means–: A unified approach to clustering and outlier detection,” in Proceedings of the 2013 SIAM international conference on data mining. SIAM, 2013, pp. 189–197.
- Z. Zhang, Q. Feng, J. Huang, Y. Guo, J. Xu, and J. Wang, “A local search algorithm for k-means with outliers,” Neurocomputing, vol. 450, pp. 230–241, 2021.
- P. Verma, M. Sinha, and S. Panda, “Fuzzy c-means clustering-based novel threshold criteria for outlier detection in electronic nose,” IEEE Sensors Journal, vol. 21, no. 2, pp. 1975–1981, 2020.
- H. Yadav, J. Singh, and A. Gosain, “Experimental analysis of fuzzy clustering techniques for outlier detection,” Procedia Computer Science, vol. 218, pp. 959–968, 2023.
- K. Beyer, J. Goldstein, R. Ramakrishnan, and U. Shaft, “When is “nearest neighbor” meaningful?” in Database Theory—ICDT’99: 7th International Conference Jerusalem, Israel, January 10–12, 1999 Proceedings 7. Springer, 1999, pp. 217–235.
- Y. LeCun, “The mnist database of handwritten digits,” http://yann. lecun. com/exdb/mnist/, 1998.
- L. Van der Maaten and G. Hinton, “Visualizing data using t-sne.” Journal of machine learning research, vol. 9, no. 11, 2008.
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Top Community Prompts
Collections
Sign up for free to add this paper to one or more collections.