Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
156 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
45 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Systematic Bias in Sample Inference and its Effect on Machine Learning (2307.01384v1)

Published 3 Jul 2023 in cs.LG and stat.ME

Abstract: A commonly observed pattern in machine learning models is an underprediction of the target feature, with the model's predicted target rate for members of a given category typically being lower than the actual target rate for members of that category in the training set. This underprediction is usually larger for members of minority groups; while income level is underpredicted for both men and women in the 'adult' dataset, for example, the degree of underprediction is significantly higher for women (a minority in that dataset). We propose that this pattern of underprediction for minorities arises as a predictable consequence of statistical inference on small samples. When presented with a new individual for classification, an ML model performs inference not on the entire training set, but on a subset that is in some way similar to the new individual, with sizes of these subsets typically following a power law distribution so that most are small (and with these subsets being necessarily smaller for the minority group). We show that such inference on small samples is subject to systematic and directional statistical bias, and that this bias produces the observed patterns of underprediction seen in ML models. Analysing a standard sklearn decision tree model's predictions on a set of over 70 subsets of the 'adult' and COMPAS datasets, we found that a bias prediction measure based on small-sample inference had a significant positive correlations (0.56 and 0.85) with the observed underprediction rate for these subsets.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (20)
  1. Machine Bias. ProPublica.
  2. Fairness in machine learning. Nips tutorial, 1: 2.
  3. Adult dataset. UCI Repository of Machine Learning Datasets.
  4. Bias in Machine Learning Software: Why? How? What to Do? In Proceedings of the 29th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering, ESEC/FSE 2021, 429–440. New York, NY, USA: Association for Computing Machinery. ISBN 9781450385626.
  5. The Measure and Mismeasure of Fairness: A Critical Review of Fair Machine Learning.
  6. Algorithmic decision making and the cost of fairness. In Proceedings of the 23rd acm sigkdd international conference on knowledge discovery and data mining, 797–806.
  7. Underestimation Bias and Underfitting in Machine Learning. In Heintz, F.; Milano, M.; and O’Sullivan, B., eds., Trustworthy AI - Integrating Learning, Optimization and Reasoning, 20–31. Cham: Springer International Publishing. ISBN 978-3-030-73959-1.
  8. de Finetti, B. 1937. La Prévision: Ses Lois Logiques, Ses Sources Subjectives. Annales de l’Institut Henri Poincaré, 17: 1–68.
  9. A Comparative Study of Fairness-Enhancing Interventions in Machine Learning. In Proceedings of the Conference on Fairness, Accountability, and Transparency, FAT* ’19, 329–338. New York, NY, USA: Association for Computing Machinery. ISBN 9781450361255.
  10. 50 Years of Test (Un)Fairness: Lessons for Machine Learning. In Proceedings of the Conference on Fairness, Accountability, and Transparency, FAT* ’19, 49–58. New York, NY, USA: Association for Computing Machinery. ISBN 9781450361255.
  11. Scaling up the accuracy of Naive-Bayes classifiers: A decision-tree hybrid. In Proceedings of the 2nd International Conference on Knowledge Discovery and Data mining, Portland, 202–207.
  12. A Survey on Bias and Fairness in Machine Learning. ACM Comput. Surv., 54(6).
  13. Bias in data-driven artificial intelligence systems—An introductory survey. Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery, 10(3): e1356.
  14. Social Data: Biases, Methodological Pitfalls, and Ethical Boundaries. Frontiers in Big Data, 2.
  15. Scikit-learn: Machine Learning in Python. Journal of Machine Learning Research, 12: 2825–2830.
  16. A survey on datasets for fairness-aware machine learning. arXiv preprint arXiv:2110.00530.
  17. A framework for understanding sources of harm throughout the machine learning life cycle. In Equity and Access in Algorithms, Mechanisms, and Optimization, 1–9.
  18. A framework for understanding unintended consequences of machine learning. arXiv preprint arXiv:1901.10002, 2.
  19. Turner Lee, N. 2018. Detecting racial bias in algorithms and machine learning. Journal of Information, Communication and Ethics in Society, 16(3): 252–260.
  20. Zabell, S. 1989. The Rule of Succession. Erkenntnis, 31 edition.

Summary

We haven't generated a summary for this paper yet.