Faster Algorithms for Fair Max-Min Diversification in $\mathbb{R}^d$ (2404.04713v3)
Abstract: The task of extracting a diverse subset from a dataset, often referred to as maximum diversification, plays a pivotal role in various real-world applications that have far-reaching consequences. In this work, we delve into the realm of fairness-aware data subset selection, specifically focusing on the problem of selecting a diverse set of size $k$ from a large collection of $n$ data points (FairDiv). The FairDiv problem is well-studied in the data management and theory community. In this work, we develop the first constant approximation algorithm for FairDiv that runs in near-linear time using only linear space. In contrast, all previously known constant approximation algorithms run in super-linear time (with respect to $n$ or $k$) and use super-linear space. Our approach achieves this efficiency by employing a novel combination of the Multiplicative Weight Update method and advanced geometric data structures to implicitly and approximately solve a linear program. Furthermore, we improve the efficiency of our techniques by constructing a coreset. Using our coreset, we also propose the first efficient streaming algorithm for the FairDiv problem whose efficiency does not depend on the distribution of data points. Empirical evaluation on million-sized datasets demonstrates that our algorithm achieves the best diversity within a minute. All prior techniques are either highly inefficient or do not generate a good solution.
- https://github.com/UIC-DB-Theory/FairDiversityandClustering.
- Beer review https://snap.stanford.edu/data/web-BeerAdvocate.html.
- Courts seek to increase jury diversity https://eji.org/report/race-and-the-jury/why-representative-juries-are-necessary/#chapter-2.
- Courts seek to increase jury diversity https://www.uscourts.gov/news/2019/05/09/courts-seek-increase-jury-diversity.
- Diversity maximization under matroid constraints. In Proceedings of the 19th ACM SIGKDD international conference on Knowledge discovery and data mining, pages 32–40, 2013.
- Range-clustering queries. In 33rd International Symposium on Computational Geometry (SoCG 2017). Schloss Dagstuhl-Leibniz-Zentrum fuer Informatik, 2017.
- Improved approximation and scalability for fair max-min diversification. In 25th International Conference on Database Theory (ICDT 2022). Schloss Dagstuhl-Leibniz-Zentrum für Informatik, 2022.
- Exact and approximation algorithms for clustering. Algorithmica, 33:201–226, 2002.
- Efficient indexes for diverse top-k range queries. In Proceedings of the 39th ACM SIGMOD-SIGACT-SIGAI Symposium on Principles of Database Systems, pages 213–227, 2020.
- The multiplicative weights update method: a meta-algorithm and applications. Theory of computing, 8(1):121–164, 2012.
- S. Arya and D. M. Mount. Approximate range searching. Computational Geometry, 17(3-4):135–152, 2000.
- An optimal algorithm for approximate nearest neighbor searching fixed dimensions. Journal of the ACM (JACM), 45(6):891–923, 1998.
- B. Becker and R. Kohavi. Adult. UCI Machine Learning Repository, 1996. DOI: https://doi.org/10.24432/C5XW20.
- S. N. Bespamyatnikh. An optimal algorithm for closest pair maintenance. In Proceedings of the eleventh annual symposium on Computational geometry, pages 152–161, 1995.
- Cover trees for nearest neighbor. In Proceedings of the 23rd international conference on Machine learning, pages 97–104, 2006.
- Max-sum diversification, monotone submodular functions, and dynamic updates. ACM Transactions on Algorithms (TALG), 13(3):1–25, 2017.
- Max-sum diversification, monotone submodular functions and dynamic updates. In Proceedings of the 31st ACM SIGMOD-SIGACT-SIGAI symposium on Principles of Database Systems, pages 155–166, 2012.
- A decomposition of multidimensional point sets with applications to k-nearest-neighbors and n-body potential fields. Journal of the ACM (JACM), 42(1):67–90, 1995.
- Fast coreset-based diversity maximization under matroid constraints. In Proceedings of the Eleventh ACM International Conference on Web Search and Data Mining, pages 81–89, 2018.
- A general coreset-based approach to diversity maximization under matroid constraints. ACM Transactions on Knowledge Discovery from Data (TKDD), 14(5):1–27, 2020.
- Local search for max-sum diversification. In Proceedings of the Twenty-Eighth Annual ACM-SIAM Symposium on Discrete Algorithms, pages 130–142. SIAM, 2017.
- T. M. Chan and Q. He. Faster approximation algorithms for geometric set cover. In 36th International Symposium on Computational Geometry (SoCG 2020), 2020.
- Incremental clustering and dynamic information retrieval. In Proceedings of the twenty-ninth annual ACM symposium on Theory of computing, pages 626–635, 1997.
- Fast lp-based approximations for geometric packing and covering problems. In Proceedings of the Fourteenth Annual ACM-SIAM Symposium on Discrete Algorithms, pages 1019–1038. SIAM, 2020.
- Matroid and knapsack center problems. Algorithmica, 75:27–52, 2016.
- How to solve fair k-center in massive data models. In International Conference on Machine Learning, pages 1877–1886. PMLR, 2020.
- K. L. Clarkson and K. Varadarajan. Improved approximation algorithms for geometric set cover. In Proceedings of the twenty-first annual symposium on Computational geometry, pages 135–141, 2005.
- T. Feder and D. Greene. Optimal algorithms for approximate clustering. In Proceedings of the twentieth annual ACM symposium on Theory of computing, pages 434–444, 1988.
- A. E. Feldmann and D. Marx. The parameterized hardness of the k-center problem in transportation networks. Algorithmica, 82:1989–2005, 2020.
- Quad trees a data structure for retrieval on composite keys. Acta informatica, 4:1–9, 1974.
- Cross-sectional jury representation or systematic jury representation? simple random and cluster sampling strategies in jury selection. Journal of Criminal Justice, 19(1):31–48, 1991.
- J. M. Gau. A jury of whose peers? the impact of selection procedures on racial composition and the prevalence of majority-white juries. Journal of Crime and Justice, 39(1):75–87, 2016.
- T. F. Gonzalez. Clustering to minimize the maximum intercluster distance. Theoretical computer science, 38:293–306, 1985.
- S. Har-Peled and M. Mendel. Fast construction of nets in low dimensional metrics, and their applications. In Proceedings of the twenty-first annual symposium on Computational geometry, pages 150–158, 2005.
- S. Har-Peled and B. Raichel. Net and prune: A linear time algorithm for euclidean distance problems. Journal of the ACM (JACM), 62(6):1–35, 2015.
- A faster algorithm for solving general lps. In Proceedings of the 53rd Annual ACM SIGACT Symposium on Theory of Computing, pages 823–832, 2021.
- Fair k-centers via maximum matching. In International conference on machine learning, pages 4940–4949. PMLR, 2020.
- M. Kahn. Diabetes. UCI Machine Learning Repository. DOI: https://doi.org/10.24432/C5T59G.
- Fair k-center clustering for data summarization. In International Conference on Machine Learning, pages 3448–3457. PMLR, 2019.
- US Census Data (1990). UCI Machine Learning Repository. DOI: https://doi.org/10.24432/C5VP42.
- Diverse data selection under fairness constraints. ICDT 2021, 2021.
- T. E. Ng and H. Zhang. Predicting internet network distance with coordinates-based approaches. In Proceedings. Twenty-First Annual Joint Conference of the IEEE Computer and Communications Societies, volume 1, pages 170–179. IEEE, 2002.
- Popsim: An individual-level population simulator for equitable allocation of city resources. Algorithmic Fairness in Artificial intelligence, Machine learning and and Decision making (AFair-AMLD23), 2023.
- E. Oh and H.-K. Ahn. Approximate range queries for clustering. In 34th International Symposium on Computational Geometry (SoCG 2018). Schloss Dagstuhl-Leibniz-Zentrum fuer Informatik, 2018.
- The intrinsic dimension of images and its impact on learning. In International Conference on Learning Representations, 2020.
- Heuristic and special case algorithms for dispersion problems. Operations research, 42(2):299–310, 1994.
- GitHub - kvombatkere/CoreSets-algorithms: Python implementation of coreset algorithms for clustering and streaming.
- A. Tamir. Obnoxious facility location on graphs. SIAM Journal on Discrete Mathematics, 4(4):550–567, 1991.
- K. Verbeek and S. Suri. Metric embedding, hyperbolic space, and social networks. In Proceedings of the thirtieth annual symposium on Computational geometry, pages 501–510, 2014.
- Streaming algorithms for diversity maximization with fairness constraints. In 2022 IEEE 38th International Conference on Data Engineering (ICDE), pages 41–53. IEEE, 2022.
- Fair max–min diversity maximization in streaming and sliding-window models. Entropy, 25(7):1066, 2023.
- Max-min diversification with fairness constraints: Exact and approximation algorithms. In Proceedings of the 2023 SIAM International Conference on Data Mining (SDM), pages 91–99. SIAM, 2023.
- Pargeo: a library for parallel computational geometry. In Proceedings of the 27th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, pages 450–452, 2022.
- J. Yang and J. Leskovec. Defining and evaluating network communities based on ground-truth. In Proceedings of the ACM SIGKDD Workshop on Mining Data Semantics, pages 1–8, 2012.
Sponsored by Paperpile, the PDF & BibTeX manager trusted by top AI labs.
Get 30 days freePaper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Top Community Prompts
Collections
Sign up for free to add this paper to one or more collections.