Fairness Without Harm: An Influence-Guided Active Sampling Approach (2402.12789v3)
Abstract: The pursuit of fairness in ML, ensuring that the models do not exhibit biases toward protected demographic groups, typically results in a compromise scenario. This compromise can be explained by a Pareto frontier where given certain resources (e.g., data), reducing the fairness violations often comes at the cost of lowering the model accuracy. In this work, we aim to train models that mitigate group fairness disparity without causing harm to model accuracy. Intuitively, acquiring more data is a natural and promising approach to achieve this goal by reaching a better Pareto frontier of the fairness-accuracy tradeoff. The current data acquisition methods, such as fair active learning approaches, typically require annotating sensitive attributes. However, these sensitive attribute annotations should be protected due to privacy and safety concerns. In this paper, we propose a tractable active data sampling algorithm that does not rely on training group annotations, instead only requiring group annotations on a small validation set. Specifically, the algorithm first scores each new example by its influence on fairness and accuracy evaluated on the validation dataset, and then selects a certain number of examples for training. We theoretically analyze how acquiring more data can improve fairness without causing harm, and validate the possibility of our sampling approach in the context of risk disparity. We also provide the upper bound of generalization error and risk disparity as well as the corresponding connections. Extensive experiments on real-world data demonstrate the effectiveness of our proposed algorithm. Our code is available at https://github.com/UCSC-REAL/FairnessWithoutHarm.
- Naeem Siddiqi. Credit risk scorecards: Developing and implementing intelligent credit scoring. 2005.
- Application of machine learning techniques for supply chain demand forecasting. Eur. J. Oper. Res., 184:1140–1154, 2008.
- On assessing trustworthy ai in healthcare best practice for machine learning as a supportive tool to recognize cardiac arrest in emergency calls. 2021.
- The rich get richer: Disparate impact of semi-supervised learning. In International Conference on Learning Representations, 2022. URL https://openreview.net/forum?id=DXPftn5kjQK.
- Empirical risk minimization under fairness constraints. Advances in neural information processing systems, 31, 2018.
- Equality of opportunity in supervised learning. Advances in neural information processing systems, 29, 2016.
- A reductions approach to fair classification. In Proceedings of the 35th International Conference on Machine Learning (ICML ’18), 2018.
- Fairness constraints: Mechanisms for fair classification. In Proceedings of the 20th International Conference on Artificial Intelligence and Statistics (AISTATS), 2017.
- Understanding instance-level impact of fairness constraints. In International Conference on Machine Learning, pages 23114–23130. PMLR, 2022a.
- Fairness improves learning from noisily labeled long-tailed data. arXiv preprint arXiv:2303.12291, 2023a.
- Weak proxies are sufficient and preferable for fairness with missing sensitive attributes. In International Conference on Machine Learning, pages 43258–43288. PMLR, 2023.
- Inherent trade-offs in the fair determination of risk scores. arXiv preprint arXiv:1609.05807, 2016.
- Heterogeneous uncertainty sampling for supervised learning. In Machine learning proceedings 1994, pages 148–156. Elsevier, 1994.
- Selective sampling for nearest neighbor classifiers. Machine learning, 54:125–152, 2004.
- Hierarchical sampling for active learning. In Proceedings of the 25th international conference on Machine learning, pages 208–215, 2008.
- Inconsistency-based active learning for support vector machines. Pattern Recognition, 45(10):3751–3767, 2012.
- Large-scale text categorization by batch mode active learning. In Proceedings of the 15th international conference on World Wide Web, pages 633–642, 2006.
- Toward optimal active learning through monte carlo estimation of error reduction. ICML, Williamstown, 2:441–448, 2001.
- Influence selection for active learning. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 9274–9283, 2021a.
- Boosting active learning via improving test performance. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 36, pages 8566–8574, 2022b.
- Training data influence analysis and estimation: A survey. arXiv preprint arXiv:2212.04612, 2022.
- Active finetuning: Exploiting annotation budget in the pretraining-finetuning paradigm. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 23715–23724, 2023.
- Assessing and remedying coverage for a given dataset. In 2019 IEEE 35th International Conference on Data Engineering (ICDE), pages 554–565. IEEE, 2019.
- Fair-db: Functional dependencies to discover data bias. In EDBT/ICDT Workshops, 2021.
- Slice tuner: A selective data acquisition framework for accurate and fair machine learning models. In Proceedings of the 2021 International Conference on Management of Data, pages 1771–1783, 2021.
- Data augmentation for discrimination prevention and bias disambiguation. In Proceedings of the AAAI/ACM Conference on AI, Ethics, and Society, pages 358–364, 2020.
- Data preprocessing to mitigate bias: A maximum entropy based approach. In International conference on machine learning, pages 1349–1359. PMLR, 2020.
- Smote: synthetic minority over-sampling technique. Journal of artificial intelligence research, 16:321–357, 2002.
- Learning fair representations. In International conference on machine learning, pages 325–333. PMLR, 2013.
- Why is my classifier discriminatory? Advances in neural information processing systems, 31, 2018.
- Data preprocessing techniques for classification without discrimination. Knowledge and Information Systems, 33(1):1–33, 2012.
- Identifying and correcting label bias in machine learning. In International Conference on Artificial Intelligence and Statistics, pages 702–712. PMLR, 2020.
- Importance weighted generative networks. In Machine Learning and Knowledge Discovery in Databases: European Conference, ECML PKDD 2019, Würzburg, Germany, September 16–20, 2019, Proceedings, Part II, pages 249–265. Springer, 2020.
- Fair generative modeling via weak supervision. In International Conference on Machine Learning, pages 1887–1898. PMLR, 2020.
- Bias mimicking: A simple sampling approach for bias mitigation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 20311–20320, 2023.
- Yi Li and Nuno Vasconcelos. Repair: Removing representation bias by dataset resampling. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 9572–9581, 2019.
- Michael Feldman. Computational fairness: Preventing machine-learned discrimination. PhD thesis, 2015.
- Certifying and removing disparate impact. In proceedings of the 21th ACM SIGKDD international conference on knowledge discovery and data mining, pages 259–268, 2015.
- Learning non-discriminatory predictors. In Conference on Learning Theory, pages 1920–1953. PMLR, 2017.
- On fairness and calibration. In Advances in Neural Information Processing Systems, pages 5680–5689, 2017.
- The cost of fairness in binary classification. In Conference on Fairness, accountability and transparency, pages 107–118. PMLR, 2018.
- Toward a better trade-off between performance and fairness with kernel-based distribution matching. arXiv preprint arXiv:1910.11779, 2019.
- Achieving fairness at no utility cost via data reweighing with influence. In International Conference on Machine Learning, pages 12917–12930. PMLR, 2022.
- Fair active learning. Expert Systems with Applications, 199:116981, 2022.
- Fairness violations and mitigation under covariate shift. In Proceedings of the 2021 ACM Conference on Fairness, Accountability, and Transparency, pages 3–13, 2021.
- Improving fair training under correlation shifts. arXiv preprint arXiv:2302.02323, 2023.
- Fairness guarantees under demographic shift. In Proceedings of the 10th International Conference on Learning Representations (ICLR), 2022.
- Fairness without demographics in repeated loss minimization. In International Conference on Machine Learning, pages 1929–1938. PMLR, 2018.
- Last layer re-training is sufficient for robustness to spurious correlations. arXiv preprint arXiv:2204.02937, 2022.
- Just train twice: Improving group robustness without training group information. In International Conference on Machine Learning, pages 6781–6792. PMLR, 2021b.
- Fairness without demographics through adversarially reweighted learning. Advances in neural information processing systems, 33:728–740, 2020.
- Hyper-parameter tuning for fair classification without sensitive attribute access. arXiv preprint arXiv:2302.01385, 2023.
- No subclass left behind: Fine-grained robustness in coarse-grained classification problems. Advances in Neural Information Processing Systems, 33:19339–19352, 2020.
- Distributionally robust post-hoc classifiers under prior shifts. arXiv preprint arXiv:2309.08825, 2023b.
- Fairness constraints: A flexible approach for fair classification. The Journal of Machine Learning Research, 20(1):2737–2778, 2019.
- On learning fairness and accuracy on multiple subgroups. Advances in Neural Information Processing Systems, 35:34121–34135, 2022.
- Fair resource allocation in federated learning. In International Conference on Learning Representations, 2019.
- What is fair? exploring pareto-efficiency for fairness constrained classifiers. arXiv preprint arXiv:1910.14120, 2019.
- Fairness with minimal harm: A pareto-optimal approach for healthcare. arXiv preprint arXiv:1911.06935, 2019.
- Vitaly Feldman. Does learning require memorization? a short tale about a long tail. In Proceedings of the 52nd Annual ACM SIGACT Symposium on Theory of Computing, pages 954–959, 2020.
- Estimating training data influence by tracing gradient descent. In Advances in Neural Information Processing Systems, volume 33, pages 19920–19930, 2020.
- Understanding black-box predictions via influence functions. In International conference on machine learning, pages 1885–1894. PMLR, 2017.
- Clusterability as an alternative to anchor points when learning with noisy labels. In Marina Meila and Tong Zhang, editors, Proceedings of the 38th International Conference on Machine Learning, volume 139 of Proceedings of Machine Learning Research, pages 12912–12923. PMLR, 18–24 Jul 2021. URL https://proceedings.mlr.press/v139/zhu21e.html.
- Unmasking and improving data credibility: A study with datasets for training harmless language models. In The Twelfth International Conference on Learning Representations, 2024. URL https://openreview.net/forum?id=6bcAD6g688.
- Learning with instance-dependent label noise: A sample sieve approach. arXiv preprint arXiv:2010.02347, 2020.
- Learning with noisy labels revisited: A study using real-world human annotations. In International Conference on Learning Representations, 2022. URL https://openreview.net/forum?id=TBWA6PLJZQm.
- Deep learning face attributes in the wild. In Proceedings of the IEEE international conference on computer vision, pages 3730–3738, 2015.
- Uci machine learning repository, 2007.
- Machine bias. 2016.
- Can active learning preemptively mitigate fairness issues? arXiv preprint arXiv:2104.06879, 2021.
- Jinlong Pang (8 papers)
- Jialu Wang (44 papers)
- Zhaowei Zhu (29 papers)
- Yuanshun Yao (28 papers)
- Chen Qian (226 papers)
- Yang Liu (2253 papers)