Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
41 tokens/sec
GPT-4o
60 tokens/sec
Gemini 2.5 Pro Pro
44 tokens/sec
o3 Pro
8 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

The Silicon Ceiling: Auditing GPT's Race and Gender Biases in Hiring (2405.04412v3)

Published 7 May 2024 in cs.CY and cs.CL
The Silicon Ceiling: Auditing GPT's Race and Gender Biases in Hiring

Abstract: LLMs are increasingly being introduced in workplace settings, with the goals of improving efficiency and fairness. However, concerns have arisen regarding these models' potential to reflect or exacerbate social biases and stereotypes. This study explores the potential impact of LLMs on hiring practices. To do so, we conduct an AI audit of race and gender biases in one commonly-used LLM, OpenAI's GPT-3.5, taking inspiration from the history of traditional offline resume audits. We conduct two studies using names with varied race and gender connotations: resume assessment (Study 1) and resume generation (Study 2). In Study 1, we ask GPT to score resumes with 32 different names (4 names for each combination of the 2 gender and 4 racial groups) and two anonymous options across 10 occupations and 3 evaluation tasks (overall rating, willingness to interview, and hireability). We find that the model reflects some biases based on stereotypes. In Study 2, we prompt GPT to create resumes (10 for each name) for fictitious job candidates. When generating resumes, GPT reveals underlying biases; women's resumes had occupations with less experience, while Asian and Hispanic resumes had immigrant markers, such as non-native English and non-U.S. education and work experiences. Our findings contribute to a growing body of literature on LLM biases, particularly in workplace contexts.

Exploring Bias in AI Hiring Practices Using GPT-3.5

Introduction to the Study

With AI technologies like LLMs making their way into various professional arenas, their use in hiring processes has attracted considerable attention. Traditionally used for tasks like content generation and customer service, these models, especially OpenAI's GPT-3.5, are now also being tested for roles in recruitment, raising important questions about fairness and bias.

A paper was conducted to examine to what extent AI, specifically GPT-3.5, might exhibit biases that could influence hiring decisions. This investigation is timely given the increasing integration of AI tools in hiring and the legislative push towards demonstrating their fairness.

Research Questions and Study Design

Two key questions guided this research:

  1. Resume Assessment: Does GPT show bias in rating resumes that differ only in the race and gender connotations of the names?
  2. Resume Generation: When tasked with creating resumes, does GPT reveal underlying biases related to race and gender?

To address these questions, researchers conducted two main studies. In Study 1: Resume Assessment, GPT-3.5 was tasked to rate resumes for various jobs based on different names indicative of diverse genders and races. Here, the focus was on how GPT rated resumes for a hypothetical applicant's hireability, willingness to be interviewed, and overall suitability. In Study 2: Resume Generation, GPT was used to generate resumes from scratch based on just names, allowing researchers to explore whether intrinsic biases could influence the content creation of the model.

Findings from the Studies

Study 1: Assessing Bias in Resume Ratings

The results from this paper indicated subtle but consistent preferences in GPT's scoring:

  • Resumes with names suggesting White ethnic backgrounds tended to receive higher ratings compared to other ethnic groups.
  • Male candidates, particularly in male-dominated fields, received higher ratings than female candidates.

This suggests that even without explicit racial or gender markers in the text, biases can still permeate through AI assessments based on culturally loaded signals like names.

Study 2: Bias in Generated Resume Content

More pronounced biases were detected in the resume content generated by GPT:

  • Women's resumes often showed lesser job experience and seniority compared to men's.
  • Resumes for Asian and Hispanic candidates more frequently included indications of immigrant status, such as non-native English skills or foreign work and educational experience, despite the prompt specifying the U.S. as the context.
  • Certain stereotypical job roles and industries were associated with specific races and genders. For example, computing roles were disproportionately suggested for Asian men, whereas clerical and retail roles were more common for women.

Implications of the Findings

The presence of biases in both resume assessment and generation by GPT-3.5 raises significant concerns about the fairness of AI-powered hiring tools. The results suggest a "silicone ceiling" where system biases could limit job opportunities for certain groups, mirroring social inequalities in automated digital environments. This has practical implications for businesses and policymakers, who must consider these biases in their deployment and regulation of AI hiring technologies.

Concluding Thoughts

While AI offers the potential to streamline and enhance hiring processes, it's clear that without careful consideration, these technologies can also perpetuate and even amplify existing disparities. Ongoing audit studies, like the one discussed here, are crucial in identifying and mitigating these biases. As AI continues to evolve, it will be imperative to balance technological advancement with ethical considerations to ensure equitable outcomes across all demographic groups. Future studies could expand on this work by exploring a wider range of identity markers and incorporating real-world hiring scenarios to more thoroughly understand and address AI bias in employment.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (94)
  1. Persistent anti-muslim bias in large language models. In Proceedings of the 2021 AAAI/ACM Conference on AI, Ethics, and Society. 298–306.
  2. Mladen Adamovic. 2020. Analyzing discrimination in recruitment: A guide and best practices for resume studies. International Journal of Selection and Assessment 28, 4 (2020), 445–464.
  3. Ifeoma Ajunwa. 2019. Automated employment discrimination. Harvard Journal of Law and Technology 34 (2019).
  4. Ifeoma Ajunwa. 2020a. An Auditing Imperative for Automated Hiring Systems. Harv. JL & Tech. 34 (2020), 621.
  5. Ifeoma Ajunwa. 2020b. The “black box” at work. Big Data & Society 7, 2 (2020), 2053951720966181.
  6. Ifeoma Ajunwa and Daniel Greene. 2019. Platforms at work: Automated hiring platforms and other new intermediaries in the organization of work. In Work and labor in the digital age. Vol. 33. Emerald Publishing Limited, 61–91.
  7. A field experiment of discrimination in the Norwegian housing market: Gender, class, and ethnicity. Land Economics 88, 2 (2012), 233–240.
  8. Machine Bias. ProPublica (2023). https://www.propublica.org/article/machine-bias-risk-assessments-in-criminal-sentencing
  9. Julia Angwin and Surya Mattu. 2020. Amazon Says It Puts Customers First. But Its Pricing Algorithm Doesn’t. ProPublica (2020). https://www.propublica.org/article/amazon-says-it-puts-customers-first-but-its-pricing-algorithm-doesnt
  10. Navigating a Black Box: Students’ Experiences and Perceptions of Automated Hiring (ICER ’23). Association for Computing Machinery, New York, NY, USA, 148–158. https://doi.org/10.1145/3568813.3600123
  11. Selecting names for experiments on ethnic discrimination. (2022).
  12. Are gay men and lesbians discriminated against when applying for jobs? A four-city, internet-based field experiment. Journal of homosexuality 60, 6 (2013), 873–894.
  13. Jack Bandy and Nicholas Diakopoulos. 2020. Auditing News Curation Systems: A Case Study Examining Algorithmic and Editorial Logic in Apple News. Proceedings of the International AAAI Conference on Web and Social Media 14, 1 (May 2020), 36–47. https://ojs.aaai.org/index.php/ICWSM/article/view/7277
  14. Anja Bechmann and Kristoffer L. Nielbo. 2018. Are We Exposed to the Same “News” in the News Feed? Digital Journalism 6, 8 (2018), 990–1002. https://doi.org/10.1080/21670811.2018.1510741 arXiv:https://doi.org/10.1080/21670811.2018.1510741
  15. Marianne Bertrand and Sendhil Mullainathan. 2004. Are Emily and Greg more employable than Lakisha and Jamal? A field experiment on labor market discrimination. American economic review 94, 4 (2004), 991–1013.
  16. Snehaan Bhawal. 2022. Resume Dataset. (2022). https://www.kaggle.com/datasets/snehaanbhawal/resume-dataset
  17. Miranda Bogen and Aaron Rieke. 2018. Help wanted: An examination of hiring algorithms, equity, and bias. (2018).
  18. Joy Buolamwini and Timnit Gebru. 2018. Gender shades: Intersectional accuracy disparities in commercial gender classification. In Conference on fairness, accountability and transparency. PMLR, 77–91.
  19. Semantics derived automatically from language corpora contain human-like biases. Science 356, 6334 (2017), 183–186.
  20. Investigating the Impact of Gender on Rank in Resume Search Engines. In Proceedings of the 2018 CHI Conference on Human Factors in Computing Systems - CHI ’18. ACM Press, Montreal QC, Canada, 1–14. https://doi.org/10.1145/3173574.3174225
  21. Peeking beneath the hood of uber. In Proceedings of the 2015 internet measurement conference. 495–508.
  22. Playing the Hiring Game: Class-Based Emotional Experiences and Tactics in Elite Hiring. Proceedings of the ACM on Human-Computer Interaction 5, CSCW2 (2021), 1–27.
  23. Getting a job: Is there a motherhood penalty? American journal of sociology 112, 5 (2007), 1297–1338.
  24. The glass ceiling effect. Social forces 80, 2 (2001), 655–681.
  25. Validated names for experimental studies on race and ethnicity. Scientific Data 10, 1 (2023), 130.
  26. Kimberlé Williams Crenshaw. 2013. Mapping the margins: Intersectionality, identity politics, and violence against women of color. In The public nature of private violence. Routledge, 93–118.
  27. Harry Cross. 1990. Employer hiring practices: differential treatment of Hispanic and Anglo job seekers. (1990).
  28. Github copilot ai pair programmer: Asset or liability? Journal of Systems and Software 203 (2023), 111734.
  29. W. W. Daniel. 1968. Racial Discrimination in England. Penguin Books.
  30. The Idea Machine: LLM-based Expansion, Rewriting, Combination, and Suggestion of Ideas. In Proceedings of the 14th Conference on Creativity and Cognition. 623–627.
  31. I vote for—how search informs our choice of candidate. Digital Dominance: The Power of Google, Amazon, Facebook, and Apple, M. Moore and D. Tambini (Eds.) 22 (2018).
  32. Fluid Transformers and Creative Analogies: Exploring Large Language Models’ Capacity for Augmenting Cross-Domain Analogical Creativity. In Proceedings of the 15th Conference on Creativity and Cognition. 489–505.
  33. Nick Drydakis. 2009. Sexual orientation discrimination in the labour market. Labour Economics 16, 4 (2009), 364–372.
  34. Factors determining callbacks to job applications by the unemployed: An audit study. RSF: The Russell Sage Foundation Journal of the Social Sciences 3, 3 (2017), 168–201.
  35. James Friedrich. 1993. Primary error detection and minimization (PEDMIN) strategies in social cognition: A reinterpretation of confirmation bias phenomena. Psychological review 100, 2 (1993), 298.
  36. S Michael Gaddis. 2018. Audit studies: Behind the scenes with theory, method, and nuance. Vol. 14. Springer.
  37. ChatGPT and the future of work: a comprehensive analysis of AI’S impact on jobs and employment. Partners Universal International Innovation Journal 1, 3 (2023), 154–186.
  38. Sparks: Inspiration for science writing using language models. In Designing interactive systems conference. 1002–1019.
  39. Laura A. Granka. 2010. The Politics of Search: A Decade Retrospective. The Information Society 26, 5 (Sept. 2010), 364–374. https://doi.org/10.1080/01972243.2010.511560
  40. Measuring price discrimination and steering on e-commerce web sites. In Proceedings of the 2014 conference on internet measurement conference. 305–318.
  41. Bias in online freelance marketplaces: Evidence from taskrabbit and fiverr. In Proceedings of the 2017 ACM conference on computer supported cooperative work and social computing. 1914–1933.
  42. Andrew Hanson and Michael Santas. 2014. Field experiment tests for discrimination against Hispanics in the US rental housing market. Southern Economic Journal 81, 1 (2014), 135–167.
  43. CausalMapper: Challenging designers to think in systems with Causal Maps and Large Language Model. In Proceedings of the 15th Conference on Creativity and Cognition. 325–329.
  44. Measuring Misinformation in Video Search Platforms: An Audit Study on YouTube. Proceedings of the ACM on Human-Computer Interaction 4, CSCW1 (2020), 1–27.
  45. Lucas D. Introna and Helen Nissenbaum. 2000. Shaping the Web: Why the Politics of Search Engines Matters. The Information Society 16, 3 (July 2000), 169–185. https://doi.org/10.1080/01972240050133634
  46. Prerna Juneja and Tanushree Mitra. 2021. Auditing E-Commerce Platforms for Algorithmically Curated Vaccine Misinformation. In Proceedings of the 2021 CHI Conference on Human Factors in Computing Systems (Yokohama, Japan) (CHI ’21). Association for Computing Machinery, New York, NY, USA, Article 186, 27 pages. https://doi.org/10.1145/3411764.3445250
  47. Whitened résumés: Race and self-presentation in the labor market. Administrative science quarterly 61, 3 (2016), 469–502.
  48. The Media Coverage of the 2020 US Presidential Election Candidates through the Lens of Google’s Top Stories. In Proceedings of the International AAAI Conference on Web and Social Media, Vol. 14. 868–877.
  49. Bias out-of-the-box: An empirical analysis of intersectional occupational biases in popular generative language models. Advances in neural information processing systems 34 (2021), 2611–2624.
  50. Gender bias and stereotypes in Large Language Models. In Proceedings of The ACM Collective Intelligence Conference. 12–24.
  51. Heather Kugelmass. 2016. “Sorry, I’m Not Accepting New Patients” an audit study of access to mental health care. Journal of Health and Social Behavior 57, 2 (2016), 168–183.
  52. CoAuthor: Designing a Human-AI Collaborative Writing Dataset for Exploring Language Model Capabilities. In Proceedings of the 2022 CHI Conference on Human Factors in Computing Systems (CHI ’22). Association for Computing Machinery, New York, NY, USA, Article 388, 19 pages. https://doi.org/10.1145/3491102.3502030
  53. Giulia Leoni and Lee D Parker. 2019. Governance and control of sharing economy platforms: Hosting on Airbnb. The British Accounting Review 51, 6 (2019), 100814.
  54. Richard M Levinson. 1975. Sex discrimination and employment practices: An experiment with unconventional job inquiries. Social Problems 22, 4 (1975), 533–543.
  55. Louis Lippens. 2024. Computer says ‘no’: Exploring systemic bias in ChatGPT using an audit approach. Computers in Human Behavior: Artificial Humans 2, 1 (2024), 100054. https://doi.org/10.1016/j.chbah.2024.100054
  56. Intersectional bias in causal language models. arXiv preprint arXiv:2107.07691 (2021).
  57. How Search Engines Disseminate Information about COVID-19 and Why They Should Do Better. Harvard Kennedy School Misinformation Review (May 2020). https://doi.org/10.37016/mr-2020-017
  58. Michela Menegatti and Monica Rubini. 2017. Gender bias and sexism in language. In Oxford research encyclopedia of communication.
  59. An image of society: Gender and racial representation and impact in image search results for occupations. Proceedings of the ACM on Human-Computer Interaction 5, CSCW1 (2021), 1–23.
  60. Search Media and Elections: A Longitudinal Investigation of Political Search Results in the 2018 U.S. Elections. In Proceedings of the 22nd ACM Conference on Computer-Supported Cooperative Work and Social Computing. ACM.
  61. Auditing algorithms: Understanding algorithmic systems from the outside in. Foundations and Trends® in Human–Computer Interaction 14, 4 (2021), 272–344.
  62. Detecting price and search discrimination on the internet. In Proceedings of the 11th ACM workshop on hot topics in networks. 79–84.
  63. Co-Writing Screenplays and Theatre Scripts with Language Models: Evaluation by Industry Professionals. In Proceedings of the 2023 CHI Conference on Human Factors in Computing Systems. 1–34.
  64. Emma Mishel. 2016. Discrimination against queer women in the US workforce: A résumé audit study. Socius 2 (2016), 2378023115621316.
  65. Nhan Nguyen and Sarah Nadi. 2022. An empirical evaluation of GitHub copilot’s code suggestions. In Proceedings of the 19th International Conference on Mining Software Repositories. 1–5.
  66. Safiya Umoja Noble. 2013. Google search: Hyper-visibility as a means of rendering black women and girls invisible. InVisible Culture 19 (2013).
  67. Dissecting Racial Bias in an Algorithm Used to Manage the Health of Populations. Science 366, 6464 (Oct. 2019), 447–453. https://doi.org/10.1126/science.aax2342
  68. NYC Department of Consumer and Worker Protection. 2023. Automated Employment Decision Tools. § 5-301 Bias Audit (2023). https://codelibrary.amlegal.com/codes/newyorkcity/latest/NYCrules/0-0-0-138391
  69. Cathy O’Neil. 2016. Weapons of math destruction: How big data increases inequality and threatens democracy. Broadway books.
  70. Philip Oreopoulos. 2011. Why do skilled immigrants struggle in the labor market? A field experiment with thirteen thousand resumes. American Economic Journal: Economic Policy 3, 4 (2011), 148–71.
  71. Mitigating bias in algorithmic hiring: Evaluating claims and practices. In Proceedings of the 2020 conference on fairness, accountability, and transparency. 469–481.
  72. Discrimination towards disabled people seeking employment. Social Science & Medicine 35, 8 (1992), 951–958.
  73. Partha Pratim Ray. 2023. ChatGPT: A comprehensive review on background, applications, key challenges, bias, ethics, limitations and future scope. Internet of Things and Cyber-Physical Systems (2023).
  74. Resume Format, LinkedIn URLs and Other Unexpected Influences on AI Personality Prediction in Hiring: Results of an Audit. In Proceedings of the 2022 AAAI/ACM Conference on AI, Ethics, and Society. 572–587.
  75. Peter A Riach and Judith Rich. 2010. An experimental investigation of age discrimination in the English labor market. Annals of Economics and Statistics/Annales d’économie et de Statistique (2010), 169–185.
  76. Lauren A Rivera and András Tilcsik. 2016. Class advantage, commitment penalty: The gendered effect of social class signals in an elite labor market. American Sociological Review 81, 6 (2016), 1097–1131.
  77. What does it mean to’solve’the problem of discrimination in hiring? Social, technical and legal perspectives from the UK on automated hiring systems. In Proceedings of the 2020 conference on fairness, accountability, and transparency. 458–468.
  78. Auditing Algorithms: Research Methods for Detecting Discrimination on Internet Platforms. Data and discrimination: converting critical concerns into productive inquiry 22 (2014).
  79. Insurance, race/ethnicity, and sex in the search for a new physician. Economics Letters 137 (2015), 150–153.
  80. Efficient Resume Classification through Rapid Dataset Creation Using ChatGPT. In 2023 14th International Conference on Information, Intelligence, Systems & Applications (IISA). IEEE, 1–5.
  81. A Silicon Valley love triangle: Hiring algorithms, pseudo-science, and the quest for auditability. Patterns 3, 2 (2022).
  82. Potential for discrimination in online targeted advertising. In Conference on Fairness, Accountability, and Transparency (FAT ’20.
  83. U.S. Bureau Of Labor Statistics. 2023a. Labor force characteristics by race and ethnicity, 2021. (2023). https://www.bls.gov/opub/reports/race-and-ethnicity/2021/home.htm
  84. U.S. Bureau Of Labor Statistics. 2023b. Labor Force Statistics from the Current Population Survey. (2023). https://www.bls.gov/cps/cpsaat11.htm
  85. Interactive and visual prompt engineering for ad-hoc task adaptation with large language models. IEEE transactions on visualization and computer graphics 29, 1 (2022), 1146–1156.
  86. Latanya Sweeney. 2013. Discrimination in Online Ad Delivery. Queue 11, 3, Article 10 (March 2013), 20 pages. https://doi.org/10.1145/2460276.2460278
  87. Daniel Trielli and Nicholas Diakopoulos. 2019. Search as News Curator: The Role of Google in Shaping Attention to News Information. In Proceedings of the 2019 CHI Conference on Human Factors in Computing Systems - CHI ’19. ACM Press, Glasgow, Scotland Uk, 1–15. https://doi.org/10.1145/3290605.3300683
  88. Jospeh Tussman and Jacobus TenBroek. 1949. The equal protection of the laws. Calif. L. Rev. 37 (1949), 341.
  89. Are Emily and Greg Still More Employable than Lakisha and Jamal? Investigating Algorithmic Hiring Bias in the Era of ChatGPT. arXiv preprint arXiv:2310.05135 (2023).
  90. Discrimination of tenants with a visual impairment on the housing market: Empirical evidence from correspondence tests. Disability and health journal 9, 2 (2016), 234–238.
  91. Doris Weichselbaumer. 2016. Discrimination against female migrants wearing headscarves. (2016).
  92. Ethical and social risks of harm from language models. arXiv preprint arXiv:2112.04359 (2021).
  93. Ronald E Wienk. 1979. Measuring racial discrimination in American housing markets: The housing market practices survey. Vol. 444. Department of Housing and Urban Development, Office of Policy Development.
  94. Wordcraft: story writing with large language models. In 27th International Conference on Intelligent User Interfaces. 841–852.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (4)
  1. Lena Armstrong (2 papers)
  2. Abbey Liu (1 paper)
  3. Stephen MacNeil (37 papers)
  4. Danaë Metaxa (7 papers)
Citations (4)