Testing Relative Fairness in Human Decisions With Machine Learning (2112.11279v2)
Abstract: Fairness in decision-making has been a long-standing issue in our society. Compared to algorithmic fairness, fairness in human decisions is even more important since there are processes where humans make the final decisions and that machine learning models inherit bias from the human decisions they were trained on. However, the standard for fairness in human decisions are highly subjective and contextual. This leads to the difficulty for testing "absolute" fairness in human decisions. To bypass this issue, this work aims to test relative fairness in human decisions. That is, instead of defining what are "absolute" fair decisions, we check the relative fairness of one decision set against another. An example outcome can be: Decision Set A favors female over male more than Decision Set B. Such relative fairness has the following benefits: (1) it avoids the ambiguous and contradictory definition of "absolute" fair decisions; (2) it reveals the relative preference and bias between different human decisions; (3) if a reference set of decisions is provided, relative fairness of other decision sets against this reference set can reflect whether those decision sets are fair by the standard of that reference set. We define the relative fairness with statistical tests (null hypothesis and effect size tests) of the decision differences across each sensitive group. Furthermore, we show that a machine learning model trained on the human decisions can inherit the bias/preference and therefore can be utilized to estimate the relative fairness between two decision sets made on different data.
- Doaa Abu-Elyounes. Contextual fairness: A legal and policy analysis of algorithmic fairness. U. Ill. JL Tech. & Pol’y, pp. 1, 2020.
- Null hypothesis testing: problems, prevalence, and an alternative. The journal of wildlife management, pp. 912–923, 2000.
- Machine bias: There’s software used across the country to predict future criminals. and it’s biased against blacks. https://www.propublica.org/article/machine-bias-risk-assessments-in-criminal-sentencing, 2016.
- Ron Artstein. Inter-annotator agreement. In Handbook of linguistic annotation, pp. 297–313. Springer, 2017.
- Bob Carpenter. Multilevel bayesian models of categorical data annotation. 2008. http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.174.1374&rep=rep1&type=pdf.
- Siu L Chow. Significance test or effect size? Psychological bulletin, 103(1):105, 1988.
- Jacob Cohen. Statistical power analysis for the behavioral sciences. Routledge, 2013.
- Jeffrey Dastin. Amazon scraps secret ai recruiting tool that showed bias against women. https://www.reuters.com/article/us-amazon-com-jobs-automation-insight/amazon-scraps-secret-ai-recruiting-tool-that-showed-bias-against-women-idUSKCN1MK08G, 2018.
- Maximum likelihood estimation of observer error-rates using the em algorithm. 28(1):20–28, 1979. ISSN 00359254, 14679876. URL http://www.jstor.org/stable/2346806.
- UCI machine learning repository, 2017. URL http://archive.ics.uci.edu/ml.
- Fairness through awareness. In Proceedings of the 3rd innovations in theoretical computer science conference, pp. 214–226, 2012.
- Momresp: A bayesian model for multi-annotator document labeling. In LREC, pp. 3704–3711, 2014.
- The (im) possibility of fairness: Different value systems require different mechanisms for fair decision making. Communications of the ACM, 64(4):136–143, 2021.
- Equality of opportunity in supervised learning. In Advances in neural information processing systems, pp. 3315–3323, 2016.
- Learning whom to trust with MACE. In Proceedings of the 2013 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 1120–1130, Atlanta, Georgia, June 2013. Association for Computational Linguistics. URL https://www.aclweb.org/anthology/N13-1132.
- Quality management on amazon mechanical turk. In Proceedings of the ACM SIGKDD workshop on human computation, pp. 64–67, 2010.
- Parting crowds: Characterizing divergent interpretations in crowdsourced annotation tasks. In Proceedings of the 19th ACM Conference on Computer-Supported Cooperative Work & Social Computing, pp. 1637–1648, 2016.
- Arjun Kharpal. Health care start-up says a.i. can diagnose patients better than humans can, doctors call that ’dubious’. https://www.cnbc.com/2018/06/28/babylon-claims-its-ai-can-diagnose-patients-better-than-doctors.html, June 2018.
- Scut-fbp5500: A diverse benchmark dataset for multi-paradigm facial beauty prediction. In 2018 24th International conference on pattern recognition (ICPR), pp. 1598–1603. IEEE, 2018.
- Learning to Predict Population-Level Label Distributions. In Seventh AAAI Conference on Human Computation and Crowdsourcing, volume 7, pp. 68–76, 2019.
- Parmy Olson. The algorithm that beats your bank manager. https://www.forbes.com/sites/parmyolson/2011/03/15/the-algorithm-that-beats-your-bank-manager/##15da2651ae99, 2011.
- Knowing what to believe (when you already know something). In Proceedings of the 23rd International Conference on Computational Linguistics (Coling 2010), pp. 877–885, 2010.
- Racial discrimination among nba referees. The Quarterly journal of economics, 125(4):1859–1887, 2010.
- Learning from crowds. Journal of Machine Learning Research, 11(4), 2010.
- The risk of racial bias in hate speech detection. In Proceedings of the 57th annual meeting of the association for computational linguistics, pp. 1668–1678, 2019.
- Shlomo S Sawilowsky. New effect size rules of thumb. Journal of modern applied statistical methods, 8(2):26, 2009.
- Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556, 2014.
- Neighborhood-based Pooling for Population-level Label Distribution Learning. In Twenty Fourth European Conference on Artificial Intelligence, 2020. URL https://arxiv.org/abs/2003.07406.
- Improving label quality by jointly modeling items and annotators. 2021.
- Bernard L Welch. The generalization of ‘student’s’problem when several different population varlances are involved. Biometrika, 34(1-2):28–35, 1947.
- Do black judges make a difference? American Journal of Political Science, pp. 126–136, 1988.
- Human intelligence needs artificial intelligence. In Workshops at the Twenty-Fifth AAAI Conference on Artificial Intelligence, 2011.
- Zhe Yu (61 papers)
- Xiaoyin Xi (1 paper)