Papers
Topics
Authors
Recent
2000 character limit reached

Faster Algorithms for Fair Max-Min Diversification in $\mathbb{R}^d$ (2404.04713v3)

Published 6 Apr 2024 in cs.DB and cs.DS

Abstract: The task of extracting a diverse subset from a dataset, often referred to as maximum diversification, plays a pivotal role in various real-world applications that have far-reaching consequences. In this work, we delve into the realm of fairness-aware data subset selection, specifically focusing on the problem of selecting a diverse set of size $k$ from a large collection of $n$ data points (FairDiv). The FairDiv problem is well-studied in the data management and theory community. In this work, we develop the first constant approximation algorithm for FairDiv that runs in near-linear time using only linear space. In contrast, all previously known constant approximation algorithms run in super-linear time (with respect to $n$ or $k$) and use super-linear space. Our approach achieves this efficiency by employing a novel combination of the Multiplicative Weight Update method and advanced geometric data structures to implicitly and approximately solve a linear program. Furthermore, we improve the efficiency of our techniques by constructing a coreset. Using our coreset, we also propose the first efficient streaming algorithm for the FairDiv problem whose efficiency does not depend on the distribution of data points. Empirical evaluation on million-sized datasets demonstrates that our algorithm achieves the best diversity within a minute. All prior techniques are either highly inefficient or do not generate a good solution.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (54)
  1. https://github.com/UIC-DB-Theory/FairDiversityandClustering.
  2. Beer review https://snap.stanford.edu/data/web-BeerAdvocate.html.
  3. Courts seek to increase jury diversity https://eji.org/report/race-and-the-jury/why-representative-juries-are-necessary/#chapter-2.
  4. Courts seek to increase jury diversity https://www.uscourts.gov/news/2019/05/09/courts-seek-increase-jury-diversity.
  5. Diversity maximization under matroid constraints. In Proceedings of the 19th ACM SIGKDD international conference on Knowledge discovery and data mining, pages 32–40, 2013.
  6. Range-clustering queries. In 33rd International Symposium on Computational Geometry (SoCG 2017). Schloss Dagstuhl-Leibniz-Zentrum fuer Informatik, 2017.
  7. Improved approximation and scalability for fair max-min diversification. In 25th International Conference on Database Theory (ICDT 2022). Schloss Dagstuhl-Leibniz-Zentrum für Informatik, 2022.
  8. Exact and approximation algorithms for clustering. Algorithmica, 33:201–226, 2002.
  9. Efficient indexes for diverse top-k range queries. In Proceedings of the 39th ACM SIGMOD-SIGACT-SIGAI Symposium on Principles of Database Systems, pages 213–227, 2020.
  10. The multiplicative weights update method: a meta-algorithm and applications. Theory of computing, 8(1):121–164, 2012.
  11. S. Arya and D. M. Mount. Approximate range searching. Computational Geometry, 17(3-4):135–152, 2000.
  12. An optimal algorithm for approximate nearest neighbor searching fixed dimensions. Journal of the ACM (JACM), 45(6):891–923, 1998.
  13. B. Becker and R. Kohavi. Adult. UCI Machine Learning Repository, 1996. DOI: https://doi.org/10.24432/C5XW20.
  14. S. N. Bespamyatnikh. An optimal algorithm for closest pair maintenance. In Proceedings of the eleventh annual symposium on Computational geometry, pages 152–161, 1995.
  15. Cover trees for nearest neighbor. In Proceedings of the 23rd international conference on Machine learning, pages 97–104, 2006.
  16. Max-sum diversification, monotone submodular functions, and dynamic updates. ACM Transactions on Algorithms (TALG), 13(3):1–25, 2017.
  17. Max-sum diversification, monotone submodular functions and dynamic updates. In Proceedings of the 31st ACM SIGMOD-SIGACT-SIGAI symposium on Principles of Database Systems, pages 155–166, 2012.
  18. A decomposition of multidimensional point sets with applications to k-nearest-neighbors and n-body potential fields. Journal of the ACM (JACM), 42(1):67–90, 1995.
  19. Fast coreset-based diversity maximization under matroid constraints. In Proceedings of the Eleventh ACM International Conference on Web Search and Data Mining, pages 81–89, 2018.
  20. A general coreset-based approach to diversity maximization under matroid constraints. ACM Transactions on Knowledge Discovery from Data (TKDD), 14(5):1–27, 2020.
  21. Local search for max-sum diversification. In Proceedings of the Twenty-Eighth Annual ACM-SIAM Symposium on Discrete Algorithms, pages 130–142. SIAM, 2017.
  22. T. M. Chan and Q. He. Faster approximation algorithms for geometric set cover. In 36th International Symposium on Computational Geometry (SoCG 2020), 2020.
  23. Incremental clustering and dynamic information retrieval. In Proceedings of the twenty-ninth annual ACM symposium on Theory of computing, pages 626–635, 1997.
  24. Fast lp-based approximations for geometric packing and covering problems. In Proceedings of the Fourteenth Annual ACM-SIAM Symposium on Discrete Algorithms, pages 1019–1038. SIAM, 2020.
  25. Matroid and knapsack center problems. Algorithmica, 75:27–52, 2016.
  26. How to solve fair k-center in massive data models. In International Conference on Machine Learning, pages 1877–1886. PMLR, 2020.
  27. K. L. Clarkson and K. Varadarajan. Improved approximation algorithms for geometric set cover. In Proceedings of the twenty-first annual symposium on Computational geometry, pages 135–141, 2005.
  28. T. Feder and D. Greene. Optimal algorithms for approximate clustering. In Proceedings of the twentieth annual ACM symposium on Theory of computing, pages 434–444, 1988.
  29. A. E. Feldmann and D. Marx. The parameterized hardness of the k-center problem in transportation networks. Algorithmica, 82:1989–2005, 2020.
  30. Quad trees a data structure for retrieval on composite keys. Acta informatica, 4:1–9, 1974.
  31. Cross-sectional jury representation or systematic jury representation? simple random and cluster sampling strategies in jury selection. Journal of Criminal Justice, 19(1):31–48, 1991.
  32. J. M. Gau. A jury of whose peers? the impact of selection procedures on racial composition and the prevalence of majority-white juries. Journal of Crime and Justice, 39(1):75–87, 2016.
  33. T. F. Gonzalez. Clustering to minimize the maximum intercluster distance. Theoretical computer science, 38:293–306, 1985.
  34. S. Har-Peled and M. Mendel. Fast construction of nets in low dimensional metrics, and their applications. In Proceedings of the twenty-first annual symposium on Computational geometry, pages 150–158, 2005.
  35. S. Har-Peled and B. Raichel. Net and prune: A linear time algorithm for euclidean distance problems. Journal of the ACM (JACM), 62(6):1–35, 2015.
  36. A faster algorithm for solving general lps. In Proceedings of the 53rd Annual ACM SIGACT Symposium on Theory of Computing, pages 823–832, 2021.
  37. Fair k-centers via maximum matching. In International conference on machine learning, pages 4940–4949. PMLR, 2020.
  38. M. Kahn. Diabetes. UCI Machine Learning Repository. DOI: https://doi.org/10.24432/C5T59G.
  39. Fair k-center clustering for data summarization. In International Conference on Machine Learning, pages 3448–3457. PMLR, 2019.
  40. US Census Data (1990). UCI Machine Learning Repository. DOI: https://doi.org/10.24432/C5VP42.
  41. Diverse data selection under fairness constraints. ICDT 2021, 2021.
  42. T. E. Ng and H. Zhang. Predicting internet network distance with coordinates-based approaches. In Proceedings. Twenty-First Annual Joint Conference of the IEEE Computer and Communications Societies, volume 1, pages 170–179. IEEE, 2002.
  43. Popsim: An individual-level population simulator for equitable allocation of city resources. Algorithmic Fairness in Artificial intelligence, Machine learning and and Decision making (AFair-AMLD23), 2023.
  44. E. Oh and H.-K. Ahn. Approximate range queries for clustering. In 34th International Symposium on Computational Geometry (SoCG 2018). Schloss Dagstuhl-Leibniz-Zentrum fuer Informatik, 2018.
  45. The intrinsic dimension of images and its impact on learning. In International Conference on Learning Representations, 2020.
  46. Heuristic and special case algorithms for dispersion problems. Operations research, 42(2):299–310, 1994.
  47. GitHub - kvombatkere/CoreSets-algorithms: Python implementation of coreset algorithms for clustering and streaming.
  48. A. Tamir. Obnoxious facility location on graphs. SIAM Journal on Discrete Mathematics, 4(4):550–567, 1991.
  49. K. Verbeek and S. Suri. Metric embedding, hyperbolic space, and social networks. In Proceedings of the thirtieth annual symposium on Computational geometry, pages 501–510, 2014.
  50. Streaming algorithms for diversity maximization with fairness constraints. In 2022 IEEE 38th International Conference on Data Engineering (ICDE), pages 41–53. IEEE, 2022.
  51. Fair max–min diversity maximization in streaming and sliding-window models. Entropy, 25(7):1066, 2023.
  52. Max-min diversification with fairness constraints: Exact and approximation algorithms. In Proceedings of the 2023 SIAM International Conference on Data Mining (SDM), pages 91–99. SIAM, 2023.
  53. Pargeo: a library for parallel computational geometry. In Proceedings of the 27th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, pages 450–452, 2022.
  54. J. Yang and J. Leskovec. Defining and evaluating network communities based on ground-truth. In Proceedings of the ACM SIGKDD Workshop on Mining Data Semantics, pages 1–8, 2012.
Citations (1)

Summary

We haven't generated a summary for this paper yet.

Slide Deck Streamline Icon: https://streamlinehq.com

Whiteboard

Dice Question Streamline Icon: https://streamlinehq.com

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Lightbulb Streamline Icon: https://streamlinehq.com

Continue Learning

We haven't generated follow-up questions for this paper yet.

List To Do Tasks Checklist Streamline Icon: https://streamlinehq.com

Collections

Sign up for free to add this paper to one or more collections.