Fair Wasserstein Coresets (2311.05436v4)
Abstract: Data distillation and coresets have emerged as popular approaches to generate a smaller representative set of samples for downstream learning tasks to handle large-scale datasets. At the same time, machine learning is being increasingly applied to decision-making processes at a societal level, making it imperative for modelers to address inherent biases towards subgroups present in the data. While current approaches focus on creating fair synthetic representative samples by optimizing local properties relative to the original samples, their impact on downstream learning processes has yet to be explored. In this work, we present fair Wasserstein coresets (FWC), a novel coreset approach which generates fair synthetic representative samples along with sample-level weights to be used in downstream learning tasks. FWC uses an efficient majority minimization algorithm to minimize the Wasserstein distance between the original dataset and the weighted synthetic samples while enforcing demographic parity. We show that an unconstrained version of FWC is equivalent to Lloyd's algorithm for k-medians and k-means clustering. Experiments conducted on both synthetic and real datasets show that FWC: (i) achieves a competitive fairness-utility tradeoff in downstream models compared to existing approaches, (ii) improves downstream fairness when added to the existing training data and (iii) can be used to reduce biases in predictions from LLMs (GPT-3.5 and GPT-4).
- Gpt-4 technical report. arXiv preprint arXiv:2303.08774, 2023.
- One-shot coresets: The case of k-clustering. In International Conference on Artificial Intelligence and Statistics, pp. 784–792. PMLR, 2018.
- Scalable fair clustering. In International Conference on Machine Learning, pp. 405–413. PMLR, 2019.
- Adult. UCI Machine Learning Repository, 1996. DOI: https://doi.org/10.24432/C5XW20.
- Introduction to linear optimization, volume 6. Athena scientific Belmont, MA, 1997.
- Fliptest: fairness testing via optimal transport. In Proceedings of the 2020 Conference on Fairness, Accountability, and Transparency, pp. 111–121, 2020.
- Coresets via bilevel optimization for continual learning and streaming. Advances in Neural Information Processing Systems, 33:14879–14890, 2020.
- Optimized pre-processing for discrimination prevention. Advances in Neural Information Processing Systems, 30, 2017.
- Bayesian coreset construction via greedy iterative geodesic ascent. In International Conference on Machine Learning, pp. 698–706. PMLR, 2018.
- On coresets for fair regression and individually fair clustering. In International Conference on Artificial Intelligence and Statistics, pp. 9603–9625. PMLR, 2022.
- Fair clustering through fairlets. Advances in Neural Information Processing Systems, 30, 2017.
- Chouldechova, A. Fair prediction with disparate impact: A study of bias in recidivism prediction instruments. Big Data, 5(2):153–163, 2017.
- A minimax framework for quantifying risk-fairness trade-off in regression. The Annals of Statistics, 50(4):2416–2442, 2022.
- Wasserstein measure coresets. arXiv preprint arXiv:1805.07412, 2018.
- Selection via proxy: Efficient data selection for deep learning. In International Conference on Learning Representations, 2020. URL https://openreview.net/forum?id=HJg2b0VYDr.
- Fairness through awareness. In Proceedings of the 3rd Innovations in Theoretical Computer Science Conference, pp. 214–226, 2012.
- Algorithmic fairness datasets: The story so far. Data Mining and Knowledge Discovery, 36(6):2074–2152, 2022.
- The five factor model of personality and evaluation of drug consumption risk. In Data Science: Innovative Developments in Data Analysis and Clustering, pp. 231–242. Springer, 2017.
- Feldman, D. Core-sets: Updated survey. Sampling Techniques for Supervised or Unsupervised Tasks, pp. 23–44, 2020.
- Toward pareto efficient fairness-utility trade-off in recommendation through reinforcement learning. In Proceedings of the fifteenth ACM international conference on web search and data mining, pp. 316–324, 2022.
- Socially fair k-means clustering. In Proceedings of the 2021 ACM Conference on Fairness, Accountability, and Transparency, pp. 438–448, 2021.
- Obtaining fairness using optimal transport theory. In International conference on machine learning, pp. 2357–2365. PMLR, 2019.
- On coresets for k-means and k-median clustering. In Proceedings of the thirty-sixth annual ACM symposium on Theory of computing, pp. 291–300, 2004.
- Equality of opportunity in supervised learning. Advances in neural information processing systems, 29, 2016.
- Hofmann, H. Statlog (German Credit Data). UCI Machine Learning Repository, 1994. DOI: https://doi.org/10.24432/C5NC77.
- Bias mitigation for machine learning classifiers: A comprehensive survey. ACM J. Responsib. Comput., nov 2023. doi: 10.1145/3631326. URL https://doi.org/10.1145/3631326.
- Coresets for clustering with fairness constraints. Advances in Neural Information Processing Systems, 32, 2019.
- An improved cutting plane method for convex optimization, convex-concave games, and its applications. In Proceedings of the 52nd Annual ACM SIGACT Symposium on Theory of Computing, pp. 944–953, 2020.
- A center in your neighborhood: Fairness in facility location. arXiv preprint arXiv:1908.09041, 2019.
- Data preprocessing techniques for classification without discrimination. Knowledge and Information Systems, 33(1):1–33, 2012.
- Kantorovitch, L. On the translocation of masses. Management Science, 5(1):1–4, 1958.
- Khachiyan, L. G. Polynomial algorithms in linear programming. USSR Computational Mathematics and Mathematical Physics, 20(1):53–72, 1980.
- Lange, K. MM optimization algorithms. SIAM, 2016.
- A comprehensive survey of dataset distillation. IEEE Transactions on Pattern Analysis & Machine Intelligence, 46(01):17–32, jan 2024. ISSN 1939-3539. doi: 10.1109/TPAMI.2023.3322540.
- Achieving fairness at no utility cost via data reweighing with influence. In International Conference on Machine Learning, pp. 12917–12930. PMLR, 2022.
- Lloyd, S. Least squares quantization in PCM. IEEE Transactions on Information Theory, 28(2):129–137, 1982.
- Individual fairness for k-clustering. In International Conference on Machine Learning, pp. 6586–6596. PMLR, 2020.
- Maranzana, F. E. On the location of supply points to minimize transportation costs. IBM Systems Journal, 2(2):129–135, 1963.
- The cost of fairness in binary classification. In Friedler, S. A. and Wilson, C. (eds.), Proceedings of the 1st Conference on Fairness, Accountability and Transparency, volume 81 of Proceedings of Machine Learning Research, pp. 107–118. PMLR, 23–24 Feb 2018. URL https://proceedings.mlr.press/v81/menon18a.html.
- Fairness in risk assessment instruments: Post-processing to achieve counterfactual equalized odds. In Proceedings of the 2021 ACM Conference on Fairness, Accountability, and Transparency, pp. 386–400, 2021.
- Explainable k-means and k-medians clustering. In III, H. D. and Singh, A. (eds.), Proceedings of the 37th International Conference on Machine Learning, volume 119 of Proceedings of Machine Learning Research, pp. 7055–7065. PMLR, 13–18 Jul 2020. URL https://proceedings.mlr.press/v119/moshkovitz20a.html.
- Data-independent neural pruning via coresets. In International Conference on Learning Representations, 2020. URL https://openreview.net/forum?id=H1gmHaEKwB.
- Better algorithms for individually fair k𝑘kitalic_k-clustering. Advances in Neural Information Processing Systems, 34:13340–13351, 2021.
- Explicit group sparse projection with applications to deep learning and NMF. Transactions on Machine Learning Research, 2022. ISSN 2835-8856. URL https://openreview.net/forum?id=jIrOeWjdpc.
- OpenAI. Chatgpt3.5. https://chat.openai.com, 2022.
- Iterative solution of nonlinear equations in several variables. SIAM, 2000.
- A simple and fast algorithm for k-medoids clustering. Expert Systems with Applications, 36(2):3336–3341, 2009.
- Scikit-learn: Machine learning in python. Journal of Machine Learning Research, 12(Oct):2825–2830, 2011.
- Computational optimal transport: With applications to data science. Foundations and Trends® in Machine Learning, 11(5-6):355–607, 2019.
- Redmond, M. Communities and Crime. UCI Machine Learning Repository, 2009. DOI: https://doi.org/10.24432/C53W3X.
- Big data: A review. In 2013 International Conference on Collaboration Technologies and Systems (CTS), pp. 42–47. IEEE, 2013.
- Santambrogio, F. Optimal transport for applied mathematicians. Birkäuser, NY, 55(58-63):94, 2015.
- Data augmentation for discrimination prevention and bias disambiguation. In Proceedings of the AAAI/ACM Conference on AI, Ethics, and Society, pp. 358–364, 2020.
- A general approach to fairness with optimal transport. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 34, pp. 3633–3640, 2020.
- A silicon valley love triangle: Hiring algorithms, pseudo-science, and the quest for auditability. Patterns, 3(2), 2022.
- Improved approximation algorithms for individually fair clustering. In International Conference on Artificial Intelligence and Statistics, pp. 8758–8779. PMLR, 2022.
- Villani, C. et al. Optimal transport: Old and new, volume 338. Springer, 2009.
- Decodingtrust: A comprehensive assessment of trustworthiness in gpt models. Thirty-seventh Conference on Neural Information Processing Systems Datasets and Benchmarks Track, 2023.
- FairWASP: Fast and optimal fair wasserstein pre-processing. arXiv preprint arXiv:2311.00109, 2023.
- Dataset distillation: A comprehensive review. arXiv preprint arXiv:2301.07014, 2023.
- Shifting machine learning for healthcare from development to deployment and from models to data. Nature Biomedical Engineering, 6(12):1330–1345, July 2022. ISSN 2157-846X. doi: 10.1038/s41551-022-00898-y. URL https://www.nature.com/articles/s41551-022-00898-y.
Collections
Sign up for free to add this paper to one or more collections.
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.