Papers
Topics
Authors
Recent
Search
2000 character limit reached

Near-Optimal Algorithms for Constrained k-Center Clustering with Instance-level Background Knowledge

Published 23 Jan 2024 in cs.LG and cs.AI | (2401.12533v4)

Abstract: Center-based clustering has attracted significant research interest from both theory and practice. In many practical applications, input data often contain background knowledge that can be used to improve clustering results. In this work, we build on widely adopted $k$-center clustering and model its input background knowledge as must-link (ML) and cannot-link (CL) constraint sets. However, most clustering problems including $k$-center are inherently $\mathcal{NP}$-hard, while the more complex constrained variants are known to suffer severer approximation and computation barriers that significantly limit their applicability. By employing a suite of techniques including reverse dominating sets, linear programming (LP) integral polyhedron, and LP duality, we arrive at the first efficient approximation algorithm for constrained $k$-center with the best possible ratio of 2. We also construct competitive baseline algorithms and empirically evaluate our approximation algorithm against them on a variety of real datasets. The results validate our theoretical findings and demonstrate the great advantages of our algorithm in terms of clustering cost, clustering quality, and running time.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (41)
  1. S. Lloyd, “Least squares quantization in pcm,” IEEE Transactions on Information Theory, vol. 28, no. 2, pp. 129–137, 1982.
  2. T. F. Gonzalez, “Clustering to minimize the maximum intercluster distance,” Theoretical Computer Science, vol. 38, pp. 293–306, 1985.
  3. D. S. Hochbaum and D. B. Shmoys, “A best possible heuristic for the k𝑘kitalic_k-center problem,” Mathematics of operations research, vol. 10, no. 2, pp. 180–184, 1985.
  4. M. Charikar, S. Guha, É. Tardos, and D. B. Shmoys, “A constant-factor approximation algorithm for the k𝑘kitalic_k-median problem,” in Proceedings of the Thirty-First Annual ACM Symposium on Theory of Computing (STOC), Atlanta, Georgia, USA, 1999, pp. 1–10.
  5. S. Khuller and Y. J. Sussmann, “The capacitated k𝑘kitalic_k-center problem,” SIAM Journal on Discrete Mathematics, vol. 13, no. 3, pp. 403–418, 2000.
  6. M. Charikar, S. Khuller, D. M. Mount, and G. Narasimhan, “Algorithms for facility location problems with outliers,” in Proceedings of the Twelfth Annual Symposium on Discrete Algorithms (SODA), Washington, DC, USA, 2001, pp. 642–651.
  7. A. Lim, B. Rodrigues, F. Wang, and Z. Xu, “k𝑘kitalic_k-center problems with minimum coverage,” Theoretical Computer Science, vol. 332, no. 1-3, pp. 1–17, 2005.
  8. M. Ester, R. Ge, B. J. Gao, Z. Hu, and B. Ben-Moshe, “Joint cluster analysis of attribute data and relationship data: The connected k𝑘kitalic_k-center problem,” in Proceedings of the Sixth SIAM International Conference on Data Mining (SDM).   Bethesda, MD, USA: SIAM, 2006, pp. 246–257.
  9. S. Khuller, R. Pless, and Y. J. Sussmann, “Fault tolerant k𝑘kitalic_k-center problems,” Theoretical Computer Science, vol. 242, no. 1-2, pp. 237–245, 2000.
  10. F. Chierichetti, R. Kumar, S. Lattanzi, and S. Vassilvitskii, “Fair clustering through fairlets,” in Advances in Neural Information Processing Systems (NeurIPS) 31, Montréal, Canada, 2017, pp. 5036–5044.
  11. M. Kleindessner, P. Awasthi, and J. Morgenstern, “Fair k𝑘kitalic_k-center clustering for data summarization,” in Proceedings of the Thirty-Sixth International Conference on Machine Learning (ICML), 2019, pp. 3448–3457.
  12. S. K. Bera, S. Das, S. Galhotra, and S. S. Kale, “Fair k𝑘kitalic_k-center clustering in MapReduce and streaming settings,” in Proceedings of the Thirty-First ACM Web Conference 2022, Virtual Event, Lyon, France, 2022, pp. 1414–1422.
  13. J. Huang, Q. Feng, Z. Huang, J. Xu, and J. Wang, “Fast algorithms for distributed k𝑘kitalic_k-clustering with outliers,” in Proceedings of the Fortieth International Conference on Machine Learning (ICML).   Honolulu, Hawaii, USA: PMLR, 2023, pp. 13 845–13 868.
  14. J. Zhang, C. Chen, Y. Xiang, W. Zhou, and A. V. Vasilakos, “An effective network traffic classification method with unknown flow detection,” IEEE Transactions on Network and Service Management, vol. 10, no. 2, pp. 133–147, 2013.
  15. X. Liu, Q. Li, and T. Li, “Private classification with limited labeled data,” Knowledge-Based Systems, vol. 133, pp. 197–207, 2017.
  16. S. Basu, A. Banerjee, and R. J. Mooney, “Active semi-supervision for pairwise constrained clustering,” in Proceedings of the Fourth SIAM International Conference on Data Mining (SDM), Orlando, Florida, USA, 2004.
  17. K. Wagstaff and C. Cardie, “Clustering with instance-level constraints,” in Proceedings of the Seventeenth International Conference on Machine Learning (ICML), Stanford, CA, USA, 2000, pp. 1103–1110.
  18. K. Wagstaff, C. Cardie, S. Rogers, and S. Schrödl, “Constrained k𝑘kitalic_k-means clustering with background knowledge,” in Proceedings of the Eighteenth International Conference on Machine Learning (ICML), Williams College, Williamstown, MA, USA, 2001, pp. 577–584.
  19. I. Davidson and S. Ravi, “The complexity of non-hierarchical clustering with instance and cluster level constraints,” Data Mining and Knowledge Discovery, vol. 14, no. 1, pp. 25–61, 2007.
  20. I. Davidson and S. Ravit, “Clustering with constraints: Feasibility issues and the k𝑘kitalic_k-means algorithm,” in Proceedings of the Fifth SIAM International Conference on Data Mining (SDM), Newport Beach, CA, USA, 2005, p. 138.
  21. T. Coleman, J. Saunderson, and A. Wirth, “Spectral clustering with inconsistent advice,” in Proceedings of the Twenty-Fifth International Conference (ICML), Helsinki, Finland, 2008, pp. 152–159.
  22. I. Davidson and S. Ravi, “Using instance-level constraints in agglomerative hierarchical clustering: Theoretical and empirical results,” Data Mining and Knowledge Discovery, vol. 18, no. 2, pp. 257–282, 2009.
  23. E. P. Xing, A. Y. Ng, M. I. Jordan, and S. Russell, “Distance metric learning, with application to clustering with side-information,” in Advances in Neural Information Processing Systems (NeurIPS) 15, Vancouver, British Columbia, Canada, 2002, pp. 521–528.
  24. I. Davidson, S. Ravi, and L. Shamis, “A sat-based framework for efficient constrained clustering,” in Proceedings of the Tenth SIAM International Conference on Data Mining (SDM).   Society for Industrial and Applied Mathematics, 2010, p. 94.
  25. B. Brubach, D. Chakrabarti, J. P. Dickerson, A. Srinivasan, and L. Tsepenekas, “Fairness, semi-supervised learning, and more: A general framework for clustering with stochastic pairwise constraints,” in Proceedings of the Thirty-Fifth AAAI Conference on Artificial Intelligence.   Vancouver, Canada: AAAI Press, 2021, pp. 6822–6830.
  26. A. Badanidiyuru, B. Mirzasoleiman, A. Karbasi, and A. Krause, “Streaming submodular maximization: Massive data summarization on the fly,” in Proceedings of the Twentieth ACM SIGKDD Conference on Knowledge Discovery and Data Mining (KDD).   New York, NY, USA: Association for Computing Machinery (ACM), 2014, pp. 671–680.
  27. P. Hall, “On representatives of subsets,” Journal of the London Mathematical Society, vol. 1, no. 1, pp. 26–30, 1935.
  28. IBM, “v22.1: User’s manual for cplex,” https://www.ibm.com/docs/en/icos/22.1.0, 2022.
  29. J. E. Hopcroft and R. M. Karp, “An n^5/2 algorithm for maximum matchings in bipartite graphs,” SIAM Journal on computing, vol. 2, no. 4, pp. 225–231, 1973.
  30. G. Malkomes, M. J. Kusner, W. Chen, K. Q. Weinberger, and B. Moseley, “Fast distributed k𝑘kitalic_k-center clustering with outliers on massive data,” in Advances in Neural Information Processing Systems (NeurIPS) 29, Montreal, Quebec, Canada, 2015, pp. 1063–1071.
  31. K. Bache and M. Lichman, “UCI machine learning repository,” http://archive.ics.uci.edu/ml, 2013.
  32. M. Tavallaee, E. Bagheri, W. Lu, and A. A. Ghorbani, “A detailed analysis of the KDD CUP 99 data set,” in IEEE Symposium on Computational Intelligence for Security and Defense Applications (CISDA), Ottawa, Canada, 2009, pp. 53–58.
  33. M. W. Group, “MAWI traffic archive,” https://mawi.wide.ad.jp/mawi/, 2020, accessed 15 May 2023.
  34. Y. Wang, Y. Xiang, J. Zhang, W. Zhou, G. Wei, and L. T. Yang, “Internet traffic classification using constrained clustering,” IEEE transactions on parallel and distributed systems, vol. 25, no. 11, pp. 2932–2943, 2013.
  35. N. Williams, S. Zander, and G. Armitage, “A preliminary performance comparison of five machine learning algorithms for practical ip traffic flow classification,” ACM SIGCOMM Computer Communication Review, vol. 36, no. 5, pp. 5–16, 2006.
  36. A. Epasto, H. Esfandiari, and V. Mirrokni, “On-device algorithms for public-private data with absolute privacy,” in The World Wide Web Conference, 2019, pp. 405–416.
  37. D. Wang, F. Nie, and H. Huang, “Unsupervised feature selection via unified trace ratio formulation and k𝑘kitalic_k-means clustering,” in Machine Learning and Knowledge Discovery in Databases: European Conference, ECML PKDD 2014.   Nancy, France: Springer, 2014, pp. 306–321.
  38. G. Lingam, R. R. Rout, and S. K. Das, “Social botnet community detection: A novel approach based on behavioral similarity in twitter network using deep learning,” in Proceedings of the Twenty-Seventh ACM SIGSAC Conference on Computer and Communications Security (CCS), Virtual Event, USA, 2020, pp. 708–718.
  39. W. M. Rand, “Objective criteria for the evaluation of clustering methods,” Journal of the American Statistical association, vol. 66, no. 336, pp. 846–850, 1971.
  40. M. Steinbach, L. Ertóz, and V. Kumar, “The challenges of clustering high dimensional data,” New Directions in Statistical Physics: Econophysics, Bioinformatics, and Pattern Recognition, p. 273, 2013.
  41. S. Alcock and R. Nelson, “Libprotoident: traffic classification using lightweight packet inspection,” Technical report, University of Waikato, Tech. Rep., 2012.

Summary

Paper to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Continue Learning

We haven't generated follow-up questions for this paper yet.

Collections

Sign up for free to add this paper to one or more collections.

Tweets

Sign up for free to view the 2 tweets with 0 likes about this paper.