Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
169 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
45 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Addressing Discretization-Induced Bias in Demographic Prediction (2405.16762v1)

Published 27 May 2024 in cs.CY and cs.LG

Abstract: Racial and other demographic imputation is necessary for many applications, especially in auditing disparities and outreach targeting in political campaigns. The canonical approach is to construct continuous predictions -- e.g., based on name and geography -- and then to $\textit{discretize}$ the predictions by selecting the most likely class (argmax). We study how this practice produces $\textit{discretization bias}$. In particular, we show that argmax labeling, as used by a prominent commercial voter file vendor to impute race/ethnicity, results in a substantial under-count of African-American voters, e.g., by 28.2% points in North Carolina. This bias can have substantial implications in downstream tasks that use such labels. We then introduce a $\textit{joint optimization}$ approach -- and a tractable $\textit{data-driven thresholding}$ heuristic -- that can eliminate this bias, with negligible individual-level accuracy loss. Finally, we theoretically analyze discretization bias, show that calibrated continuous models are insufficient to eliminate it, and that an approach such as ours is necessary. Broadly, we warn researchers and practitioners against discretizing continuous demographic predictions without considering downstream consequences.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (77)
  1. Calibrated recommendations as a minimum-cost flow problem. In Proceedings of the Sixteenth ACM International Conference on Web Search and Data Mining, WSDM ’23, page 571–579, New York, NY, USA, 2023. Association for Computing Machinery. ISBN 9781450394079. doi: 10.1145/3539597.3570402. URL https://doi.org/10.1145/3539597.3570402.
  2. Using the bayesian improved surname geocoding method (bisg) to create a working classification of race and ethnicity in a diverse managed care population: A validation study. Health services research, 49(1):268–283, 2014.
  3. What we can’t measure, we can’t understand: Challenges to demographic data procurement in the pursuit of fairness. In Proceedings of the 2021 ACM conference on fairness, accountability, and transparency, pages 249–260, 2021.
  4. Gender, race, age, and voting: A research note. In APSA 2011 Annual Meeting Paper, 2011.
  5. Validation: What big data reveal about survey misreporting and the real electorate. Political Analysis, 20(4):437–459, 2012.
  6. Fair lending: Implications for the indirect auto finance market, Nov 2014. URL https://www.crai.com/insights-events/publications/fair-lending-implications-indirect-auto-finance-market/.
  7. Domain constraints improve risk prediction when outcome data is missing. In The Twelfth International Conference on Learning Representations, 2024.
  8. 400 million voting records show profound racial and geographic disparities in voter turnout in the united states. PLOS ONE, 17(6):e0268134, 2022.
  9. Auditing saliency cropping algorithms. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), pages 4051–4059, January 2022.
  10. Toward operationalizing pipeline-aware ml fairness: a research agenda for developing practical guidelines and tools. In Proceedings of the 3rd ACM Conference on Equity and Access in Algorithms, Mechanisms, and Optimization, pages 1–11, 2023.
  11. Fair allocation through selective information acquisition. In Proceedings of the AAAI/ACM Conference on AI, Ethics, and Society, pages 22–28, 2020.
  12. Fairness in machine learning: A survey. ACM Computing Surveys, 2020.
  13. Why is my classifier discriminatory? Advances in neural information processing systems, 31, 2018.
  14. Fairness under unawareness: Assessing disparity when protected class is unobserved. In Proceedings of the Conference on Fairness, Accountability, and Transparency, FAT* ’19, page 339–348, New York, NY, USA, 2019. Association for Computing Machinery. ISBN 9781450361255. doi: 10.1145/3287560.3287594. URL https://doi.org/10.1145/3287560.3287594.
  15. Methods for retrospectively improving race/ethnicity data quality: A scoping review. Epidemiologic Reviews, page mxad002, 2023.
  16. Predicting race and ethnicity from the sequence of characters in a name, 2023.
  17. Consumer Financial Protection Bureau. Using publicly available information to proxy for unidentified race and ethnicity: A methodology and assessment. Washington, DC: CFPB, Summer, 2014.
  18. Who audits the auditors? recommendations from a field scan of the algorithmic auditing ecosystem. In Proceedings of the 2022 ACM Conference on Fairness, Accountability, and Transparency, pages 1571–1583, 2022.
  19. Ari Decter-Frain. How should we proxy for race/ethnicity? comparing bayesian improved surname geocoding to machine learning methods. arXiv preprint arXiv:2206.14583, 2022.
  20. Comparing methods for estimating demographics in racially polarized voting analyses. Sociological Methods & Research, page 00491241231192383, 2022.
  21. Validating the applicability of bayesian inference with surname and geocoding to congressional redistricting. Political Analysis, pages 1–7, 2022.
  22. The effects of rent control expansion on tenants, landlords, and inequality: Evidence from san francisco. American Economic Review, 109(9):3365–3394, 2019.
  23. CVXPY: A Python-embedded modeling language for convex optimization. Journal of Machine Learning Research, 17(83):1–5, 2016.
  24. Risk of being killed by police use of force in the united states by age, race–ethnicity, and sex. Proceedings of the national academy of sciences, 116(34):16793–16798, 2019.
  25. Using the census bureau’s surname list to improve estimates of race/ethnicity and associated disparities. Health Services and Outcomes Research Methodology, 9:69–83, 2009.
  26. Use of geocoding and surname analysis to estimate race and ethnicity. Health services research, 41(4p1):1482–1500, 2006.
  27. Reversion to the mean, or their version of the dream? an analysis of latino voting in 2020, 2023.
  28. Bernard L Fraga. Candidates or districts? reevaluating the role of race in voter turnout. American Journal of Political Science, 60(1):97–122, 2016a.
  29. Bernard L Fraga. Redistricting and the causal impact of race on voter turnout. The Journal of Politics, 78(1):19–34, 2016b.
  30. Bernard L Fraga. The turnout gap: Race, ethnicity, and political inequality in a diversifying America. Cambridge University Press, 2018.
  31. Word embeddings quantify 100 years of gender and ethnic stereotypes. Proceedings of the National Academy of Sciences, 115(16):E3635–E3644, 2018.
  32. Standardized tests and affirmative action: The role of bias and variance. In Proceedings of the 2021 ACM Conference on Fairness, Accountability, and Transparency, pages 261–261, 2021.
  33. Voter registration databases and mrp: Toward the use of large-scale databases in public opinion research. Political Analysis, 28(4):507–531, 2020.
  34. When fair ranking meets uncertain inference. In Proceedings of the 44th international ACM SIGIR conference on research and development in information retrieval, pages 1033–1043, 2021.
  35. Replication Data for: BISG: When inferring race or ethnicity, does it matter that people often live near their relatives?, 2023. URL https://doi.org/10.7910/DVN/QIM4UF.
  36. An improved bisg for inferring race from surname and geolocation, 2024.
  37. Race and representation in campaign finance. American Political Science Review, 114(1):206–221, 2020.
  38. The stereotyping problem in collaboratively filtered recommender systems. In Equity and Access in Algorithms, Mechanisms, and Optimization, pages 1–10, 2021.
  39. Gurobi Optimization, LLC. Gurobi Optimizer Reference Manual, 2023. URL https://www.gurobi.com.
  40. A systematic study of bias amplification. arXiv preprint arXiv:2201.11706, 2022.
  41. Towards a critical race methodology in algorithmic fairness. In Proceedings of the 2020 conference on fairness, accountability, and transparency, pages 501–512, 2020.
  42. Women also snowboard: Overcoming bias in captioning models. In Proceedings of the European conference on computer vision (ECCV), pages 771–787, 2018.
  43. Eitan D Hersh. Hacking the electorate: How campaigns perceive voters. Cambridge University Press, 2015.
  44. The primacy of race in the geography of income-based voting: New evidence from public voting records. American Journal of Political Science, 60(2):289–303, 2016.
  45. The curious case of neural text degeneration, 2020.
  46. Addressing census data problems in race imputation via fully bayesian improved surname geocoding and name supplements. Science Advances, 8(49):eadc9824, 2022.
  47. Measurement and fairness. In Proceedings of the 2021 ACM conference on fairness, accountability, and transparency, pages 375–385, 2021.
  48. Supply-side equilibria in recommender systems. In Thirty-seventh Conference on Neural Information Processing Systems, 2023a. URL https://openreview.net/forum?id=eqyhjLG5Nr.
  49. Improved bayes risk can yield reduced social welfare under competition. arXiv preprint arXiv:2306.14670, 2023b.
  50. The importance of being ernest, ekundayo, or eswari: an interpretable machine learning approach to name-based ethnicity classification. 2022.
  51. Mitigating gender bias amplification in distribution by posterior regularization. arXiv preprint arXiv:2005.06251, 2020.
  52. Assessing algorithmic fairness with unobserved protected class using data combination. Management Science, 68(3):1959–1981, 2022.
  53. The critical role of racial/ethnic data disaggregation for health equity. Population research and policy review, 40:1–7, 2021.
  54. Name nationality classification with recurrent neural networks. In IJCAI, volume 17, pages 2081–2087, 2017.
  55. Feature-wise bias amplification. In International Conference on Learning Representations (ICLR), 2019.
  56. Test-optional policies: Overcoming strategic behavior and informational gaps. Equity and Access in Algorithms, Mechanisms, and Optimization, pages 1–13, 2021.
  57. Feedback loop and bias amplification in recommender systems. In Proceedings of the 29th ACM international conference on information & knowledge management, pages 2145–2148, 2020.
  58. Estimating racial disparities when race is not observed. arXiv preprint arXiv:2303.02580, 2023.
  59. Coarse race data conceals disparities in clinical risk score performance. Machine Learning for Healthcare (ML4HC), 2023.
  60. Arvind Narayanan. Argmax bias, 2021. URL https://twitter.com/random_walker/status/1399348241142104064. @random_walker.
  61. Active fairness in algorithmic decision making. In Proceedings of the 2019 AAAI/ACM Conference on AI, Ethics, and Society, AIES ’19, page 77–83, New York, NY, USA, 2019. Association for Computing Machinery. ISBN 9781450363242. doi: 10.1145/3306618.3314277. URL https://doi.org/10.1145/3306618.3314277.
  62. Reconciling the accuracy-diversity trade-off in recommendations. In The Web Conference, 2024.
  63. A large-scale analysis of racial disparities in police stops across the united states. Nature human behaviour, 4(7):736–745, 2020.
  64. Disaggregating heterogeneity among non-hispanic whites: Evidence and implications for us racial/ethnic health disparities. Population research and policy review, 40:9–31, 2021.
  65. A constrained optimization approach for calibrated recommendations. In Proceedings of the 15th ACM Conference on Recommender Systems, RecSys ’21, page 607–612, New York, NY, USA, 2021. Association for Computing Machinery. ISBN 9781450384582. doi: 10.1145/3460231.3478857. URL https://doi.org/10.1145/3460231.3478857.
  66. Evaluating gender bias in machine translation. In Anna Korhonen, David Traum, and Lluís Màrquez, editors, Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 1679–1684, Florence, Italy, July 2019. Association for Computational Linguistics. doi: 10.18653/v1/P19-1164. URL https://aclanthology.org/P19-1164.
  67. Harald Steck. Calibrated recommendations. In Proceedings of the 12th ACM conference on recommender systems, pages 154–162, 2018.
  68. Data feedback loops: Model-driven amplification of dataset biases. In Andreas Krause, Emma Brunskill, Kyunghyun Cho, Barbara Engelhardt, Sivan Sabato, and Jonathan Scarlett, editors, Proceedings of the 40th International Conference on Machine Learning, volume 202 of Proceedings of Machine Learning Research, pages 33883–33920. PMLR, 23–29 Jul 2023. URL https://proceedings.mlr.press/v202/taori23a.html.
  69. Targetsmart, Sep 2022. URL https://targetsmart.com/the-targetsmart-voter-registration-dashboard/.
  70. Randomized gates eliminate bias in sort-seq assays. Protein Science, 31(9):e4401, 2022.
  71. Ioan Voicu. Using first name information to improve race and ethnicity classification. Statistics and Public Policy, 5(1):1–13, 2018.
  72. Directional bias amplification. In International Conference on Machine Learning, pages 10882–10893. PMLR, 2021.
  73. Towards intersectionality in machine learning: Including more identities, handling underrepresentation, and performing evaluation. In Proceedings of the 2022 ACM Conference on Fairness, Accountability, and Transparency, pages 336–349, 2022.
  74. Image cropping on twitter: Fairness metrics, their limitations, and the importance of representation, design, and agency. Proc. ACM Hum.-Comput. Interact., 5(CSCW2), oct 2021. doi: 10.1145/3479594. URL https://doi.org/10.1145/3479594.
  75. Yan Zhang. Assessing fair lending risks using race/ethnicity proxies. Management Science, 64(1):178–197, 2018.
  76. Men also do laundry: Multi-attribute bias amplification. In Andreas Krause, Emma Brunskill, Kyunghyun Cho, Barbara Engelhardt, Sivan Sabato, and Jonathan Scarlett, editors, Proceedings of the 40th International Conference on Machine Learning, volume 202 of Proceedings of Machine Learning Research, pages 42000–42017. PMLR, 23–29 Jul 2023. URL https://proceedings.mlr.press/v202/zhao23a.html.
  77. Men also like shopping: Reducing gender bias amplification using corpus-level constraints. In Martha Palmer, Rebecca Hwa, and Sebastian Riedel, editors, Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, pages 2979–2989, Copenhagen, Denmark, September 2017. Association for Computational Linguistics. doi: 10.18653/v1/D17-1323. URL https://aclanthology.org/D17-1323.
Citations (1)

Summary

We haven't generated a summary for this paper yet.

X Twitter Logo Streamline Icon: https://streamlinehq.com