Papers
Topics
Authors
Recent
Search
2000 character limit reached

OTClean: Data Cleaning for Conditional Independence Violations using Optimal Transport

Published 4 Mar 2024 in cs.LG, cs.AI, and cs.DB | (2403.02372v1)

Abstract: Ensuring Conditional Independence (CI) constraints is pivotal for the development of fair and trustworthy machine learning models. In this paper, we introduce \sys, a framework that harnesses optimal transport theory for data repair under CI constraints. Optimal transport theory provides a rigorous framework for measuring the discrepancy between probability distributions, thereby ensuring control over data utility. We formulate the data repair problem concerning CIs as a Quadratically Constrained Linear Program (QCLP) and propose an alternating method for its solution. However, this approach faces scalability issues due to the computational cost associated with computing optimal transport distances, such as the Wasserstein distance. To overcome these scalability challenges, we reframe our problem as a regularized optimization problem, enabling us to develop an iterative algorithm inspired by Sinkhorn's matrix scaling algorithm, which efficiently addresses high-dimensional and large-scale data. Through extensive experiments, we demonstrate the efficacy and efficiency of our proposed methods, showcasing their practical utility in real-world data cleaning and preprocessing tasks. Furthermore, we provide comparisons with traditional approaches, highlighting the superiority of our techniques in terms of preserving data utility while ensuring adherence to the desired CI constraints.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (52)
  1. 2023. Adult Data Set. https://archive.ics.uci.edu/ml/datasets/adult
  2. 2023. The Boston Housing Dataset. https://www.cs.toronto.edu/~delve/data/boston/bostonDetail.html
  3. 2023. Car Evaluation Data Set. https://archive.ics.uci.edu/ml/datasets/Car+Evaluation UCI Machine Learning Repository.
  4. 2023. COMPAS Analysis. https://github.com/propublica/compas-analysis/
  5. Conditionally Independent Data Generation. In UAI. 2050–2060.
  6. Invariant Risk Minimization. arXiv preprint arXiv:1907.02893 (2019).
  7. Wasserstein Generative Adversarial Networks. In ICML. 214–223.
  8. Leopoldo Bertossi. 2006. Consistent Query Answering in Databases. ACM SIGMOD Record 35, 2 (2006), 68–76.
  9. FlipTest: Fairness Testing via Optimal Transport. In FAccT. 111–121.
  10. Conditional Functional Dependencies for Data Cleaning. In ICDE. 746–755.
  11. Convex Optimization. Cambridge University Press.
  12. Distributed Optimization and Statistical Learning via the Alternating Direction Method of Multipliers. Foundations and Trends® in Machine Learning 3, 1 (2011), 1–122.
  13. Engin Bozdag. 2013. Bias in Algorithmic Filtering and Personalization. Ethics and Information Technology 15, 3 (2013), 209–227.
  14. Optimized Pre-Processing for Discrimination Prevention. NeurIPS 30 (2017).
  15. Simon Caton and Christian Haas. 2023. Fairness in Machine Learning: A Survey. Computing Surveys (2023).
  16. Thomas M Cover. 1999. Elements of Information Theory. John Wiley & Sons.
  17. Marco Cuturi. 2013. Sinkhorn Distances: Lightspeed Computation of Optimal Transport. NeurIPS 26 (2013).
  18. Tamraparni Dasu and Ji Meng Loh. 2012. Statistical Distortion: Consequences of Data Cleaning. VLDB 5, 11 (2012).
  19. Wenfei Fan and Floris Geerts. 2022. Foundations of Data Quality Management. Springer Nature.
  20. Certifying and Removing Disparate Impact. In KDD. 259–268.
  21. Lorenzo Finesso and Peter Spreij. 2004. Approximate Nonnegative Matrix Factorization via Alternating Minimization. arXiv preprint math/0402229 (2004).
  22. Learning with a Wasserstein Loss. NeurIPS 28 (2015).
  23. Explaining Black-Box Algorithms using Probabilistic Contrastive Counterfactuals. In SIGMOD. 577–590.
  24. Obtaining Fairness using Optimal Transport Theory. In ICML. 2357–2365.
  25. Isabelle Guyon and André Elisseeff. 2003. An Introduction to Variable and Feature Selection. JMLR 3, Mar (2003), 1157–1182.
  26. Algorithmic Bias: From Discrimination Discovery to Fairness-Aware Data Mining. In KDD. 2125–2126.
  27. Le Thi Khanh Hien and Nicolas Gillis. 2021. Algorithms for Nonnegative Matrix Factorization with the Kullback–Leibler Divergence. SISC 87, 3 (2021), 1–32.
  28. Sara Hooker. 2021. Moving Beyond “Algorithmic Bias is a Data Problem”. Patterns 2, 4 (2021), 100241.
  29. Hyperimpute: Generalized Iterative Imputation with Automatic Model Selection. In ICML. 9916–9937.
  30. Solmaz Kolahi and Laks VS Lakshmanan. 2009. On Approximating Optimum Repairs for Functional Dependency Violations. In ICDT. 53–62.
  31. Daphne Koller and Nir Friedman. 2009. Probabilistic Graphical Models: Principles and Techniques. MIT press.
  32. Toward Optimal Feature Selection. In ICML, Vol. 96. 292.
  33. Daniel Lee and H Sebastian Seung. 2000. Algorithms for Non-Negative Matrix Factorization. NeurIPS 13 (2000).
  34. Ester Livshits and Benny Kimelfeld. 2022. The Shapley Value of Inconsistency Measures for Functional Dependencies. LMCS 18 (2022).
  35. Computing Optimal Repairs for Functional Dependencies. TODS 45, 1 (2020), 1–46.
  36. Domain Adaptation by Using Causal Inference to Predict Invariant Conditional Distributions. NeurIPS 31 (2018).
  37. Mohammad Mahdavi and Ziawasch Abedjan. 2020. Baran: Effective Error Correction via a Unified Context Representation and Transfer Learning. VLDB 13, 12 (2020), 1948–1961.
  38. Judea Pearl et al. 2009. Causal Inference in Statistics: An Overview. Statistics Surveys 3 (2009), 96–146.
  39. Ofir Pele and Michael Werman. 2009. Fast and Robust Earth Mover’s Distances. In ICCV. 460–467.
  40. Efficient Conditionally Invariant Representation Learning. ICLR (2023).
  41. Invariant Models for Causal Transfer Learning. JMLR 19, 1 (2018), 1309–1342.
  42. Interventional Fairness: Causal Database Repair for Algorithmic Fairness. In SIGMOD. 793–810.
  43. Testing Group Fairness via Optimal Transport Projections. In ICML. 9649–9659.
  44. A General Approach to Fairness with Optimal Transport. In AAAI, Vol. 34. 3633–3640.
  45. Richard Sinkhorn. 1964. A relationship Between Arbitrary Positive Matrices and Doubly Stochastic Matrices. The Annals of Mathematical Statistics 35, 2 (1964), 876–879.
  46. Antonio Torralba and Alexei A Efros. 2011. Unbiased Look at Dataset Bias. In CVPR. 1521–1528.
  47. Paul Tseng. 2001. Convergence of a Block Coordinate Descent Method for Nondifferentiable Minimization. JOTA 109, 3 (2001), 475.
  48. C Van de Panne. 1966. Programming with a Quadratic Constraint. Management Science 12, 11 (1966), 798–815.
  49. Kilian Q Weinberger and Gerald Tesauro. 2007. Metric Learning for Kernel Regression. In AISTATS. 612–619.
  50. On the Implication Problem for Probabilistic Conditional Independency. SMC 30, 6 (2000), 785–805.
  51. SCODED: Statistical Constraint Oriented Data Error Detection. In SIGMOD. 845–860.
  52. GAIN: Missing Data Imputation using Generative Adversarial Nets. In ICML. 5689–5698.
Citations (1)

Summary

Paper to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Continue Learning

We haven't generated follow-up questions for this paper yet.

Collections

Sign up for free to add this paper to one or more collections.