The Silicon Ceiling: Auditing GPT's Race and Gender Biases in Hiring
Abstract: LLMs are increasingly being introduced in workplace settings, with the goals of improving efficiency and fairness. However, concerns have arisen regarding these models' potential to reflect or exacerbate social biases and stereotypes. This study explores the potential impact of LLMs on hiring practices. To do so, we conduct an AI audit of race and gender biases in one commonly-used LLM, OpenAI's GPT-3.5, taking inspiration from the history of traditional offline resume audits. We conduct two studies using names with varied race and gender connotations: resume assessment (Study 1) and resume generation (Study 2). In Study 1, we ask GPT to score resumes with 32 different names (4 names for each combination of the 2 gender and 4 racial groups) and two anonymous options across 10 occupations and 3 evaluation tasks (overall rating, willingness to interview, and hireability). We find that the model reflects some biases based on stereotypes. In Study 2, we prompt GPT to create resumes (10 for each name) for fictitious job candidates. When generating resumes, GPT reveals underlying biases; women's resumes had occupations with less experience, while Asian and Hispanic resumes had immigrant markers, such as non-native English and non-U.S. education and work experiences. Our findings contribute to a growing body of literature on LLM biases, particularly in workplace contexts.
- Persistent anti-muslim bias in large language models. In Proceedings of the 2021 AAAI/ACM Conference on AI, Ethics, and Society. 298–306.
- Mladen Adamovic. 2020. Analyzing discrimination in recruitment: A guide and best practices for resume studies. International Journal of Selection and Assessment 28, 4 (2020), 445–464.
- Ifeoma Ajunwa. 2019. Automated employment discrimination. Harvard Journal of Law and Technology 34 (2019).
- Ifeoma Ajunwa. 2020a. An Auditing Imperative for Automated Hiring Systems. Harv. JL & Tech. 34 (2020), 621.
- Ifeoma Ajunwa. 2020b. The “black box” at work. Big Data & Society 7, 2 (2020), 2053951720966181.
- Ifeoma Ajunwa and Daniel Greene. 2019. Platforms at work: Automated hiring platforms and other new intermediaries in the organization of work. In Work and labor in the digital age. Vol. 33. Emerald Publishing Limited, 61–91.
- A field experiment of discrimination in the Norwegian housing market: Gender, class, and ethnicity. Land Economics 88, 2 (2012), 233–240.
- Machine Bias. ProPublica (2023). https://www.propublica.org/article/machine-bias-risk-assessments-in-criminal-sentencing
- Julia Angwin and Surya Mattu. 2020. Amazon Says It Puts Customers First. But Its Pricing Algorithm Doesn’t. ProPublica (2020). https://www.propublica.org/article/amazon-says-it-puts-customers-first-but-its-pricing-algorithm-doesnt
- Navigating a Black Box: Students’ Experiences and Perceptions of Automated Hiring (ICER ’23). Association for Computing Machinery, New York, NY, USA, 148–158. https://doi.org/10.1145/3568813.3600123
- Selecting names for experiments on ethnic discrimination. (2022).
- Are gay men and lesbians discriminated against when applying for jobs? A four-city, internet-based field experiment. Journal of homosexuality 60, 6 (2013), 873–894.
- Jack Bandy and Nicholas Diakopoulos. 2020. Auditing News Curation Systems: A Case Study Examining Algorithmic and Editorial Logic in Apple News. Proceedings of the International AAAI Conference on Web and Social Media 14, 1 (May 2020), 36–47. https://ojs.aaai.org/index.php/ICWSM/article/view/7277
- Anja Bechmann and Kristoffer L. Nielbo. 2018. Are We Exposed to the Same “News” in the News Feed? Digital Journalism 6, 8 (2018), 990–1002. https://doi.org/10.1080/21670811.2018.1510741 arXiv:https://doi.org/10.1080/21670811.2018.1510741
- Marianne Bertrand and Sendhil Mullainathan. 2004. Are Emily and Greg more employable than Lakisha and Jamal? A field experiment on labor market discrimination. American economic review 94, 4 (2004), 991–1013.
- Snehaan Bhawal. 2022. Resume Dataset. (2022). https://www.kaggle.com/datasets/snehaanbhawal/resume-dataset
- Miranda Bogen and Aaron Rieke. 2018. Help wanted: An examination of hiring algorithms, equity, and bias. (2018).
- Joy Buolamwini and Timnit Gebru. 2018. Gender shades: Intersectional accuracy disparities in commercial gender classification. In Conference on fairness, accountability and transparency. PMLR, 77–91.
- Semantics derived automatically from language corpora contain human-like biases. Science 356, 6334 (2017), 183–186.
- Investigating the Impact of Gender on Rank in Resume Search Engines. In Proceedings of the 2018 CHI Conference on Human Factors in Computing Systems - CHI ’18. ACM Press, Montreal QC, Canada, 1–14. https://doi.org/10.1145/3173574.3174225
- Peeking beneath the hood of uber. In Proceedings of the 2015 internet measurement conference. 495–508.
- Playing the Hiring Game: Class-Based Emotional Experiences and Tactics in Elite Hiring. Proceedings of the ACM on Human-Computer Interaction 5, CSCW2 (2021), 1–27.
- Getting a job: Is there a motherhood penalty? American journal of sociology 112, 5 (2007), 1297–1338.
- The glass ceiling effect. Social forces 80, 2 (2001), 655–681.
- Validated names for experimental studies on race and ethnicity. Scientific Data 10, 1 (2023), 130.
- Kimberlé Williams Crenshaw. 2013. Mapping the margins: Intersectionality, identity politics, and violence against women of color. In The public nature of private violence. Routledge, 93–118.
- Harry Cross. 1990. Employer hiring practices: differential treatment of Hispanic and Anglo job seekers. (1990).
- Github copilot ai pair programmer: Asset or liability? Journal of Systems and Software 203 (2023), 111734.
- W. W. Daniel. 1968. Racial Discrimination in England. Penguin Books.
- The Idea Machine: LLM-based Expansion, Rewriting, Combination, and Suggestion of Ideas. In Proceedings of the 14th Conference on Creativity and Cognition. 623–627.
- I vote for—how search informs our choice of candidate. Digital Dominance: The Power of Google, Amazon, Facebook, and Apple, M. Moore and D. Tambini (Eds.) 22 (2018).
- Fluid Transformers and Creative Analogies: Exploring Large Language Models’ Capacity for Augmenting Cross-Domain Analogical Creativity. In Proceedings of the 15th Conference on Creativity and Cognition. 489–505.
- Nick Drydakis. 2009. Sexual orientation discrimination in the labour market. Labour Economics 16, 4 (2009), 364–372.
- Factors determining callbacks to job applications by the unemployed: An audit study. RSF: The Russell Sage Foundation Journal of the Social Sciences 3, 3 (2017), 168–201.
- James Friedrich. 1993. Primary error detection and minimization (PEDMIN) strategies in social cognition: A reinterpretation of confirmation bias phenomena. Psychological review 100, 2 (1993), 298.
- S Michael Gaddis. 2018. Audit studies: Behind the scenes with theory, method, and nuance. Vol. 14. Springer.
- ChatGPT and the future of work: a comprehensive analysis of AI’S impact on jobs and employment. Partners Universal International Innovation Journal 1, 3 (2023), 154–186.
- Sparks: Inspiration for science writing using language models. In Designing interactive systems conference. 1002–1019.
- Laura A. Granka. 2010. The Politics of Search: A Decade Retrospective. The Information Society 26, 5 (Sept. 2010), 364–374. https://doi.org/10.1080/01972243.2010.511560
- Measuring price discrimination and steering on e-commerce web sites. In Proceedings of the 2014 conference on internet measurement conference. 305–318.
- Bias in online freelance marketplaces: Evidence from taskrabbit and fiverr. In Proceedings of the 2017 ACM conference on computer supported cooperative work and social computing. 1914–1933.
- Andrew Hanson and Michael Santas. 2014. Field experiment tests for discrimination against Hispanics in the US rental housing market. Southern Economic Journal 81, 1 (2014), 135–167.
- CausalMapper: Challenging designers to think in systems with Causal Maps and Large Language Model. In Proceedings of the 15th Conference on Creativity and Cognition. 325–329.
- Measuring Misinformation in Video Search Platforms: An Audit Study on YouTube. Proceedings of the ACM on Human-Computer Interaction 4, CSCW1 (2020), 1–27.
- Lucas D. Introna and Helen Nissenbaum. 2000. Shaping the Web: Why the Politics of Search Engines Matters. The Information Society 16, 3 (July 2000), 169–185. https://doi.org/10.1080/01972240050133634
- Prerna Juneja and Tanushree Mitra. 2021. Auditing E-Commerce Platforms for Algorithmically Curated Vaccine Misinformation. In Proceedings of the 2021 CHI Conference on Human Factors in Computing Systems (Yokohama, Japan) (CHI ’21). Association for Computing Machinery, New York, NY, USA, Article 186, 27 pages. https://doi.org/10.1145/3411764.3445250
- Whitened résumés: Race and self-presentation in the labor market. Administrative science quarterly 61, 3 (2016), 469–502.
- The Media Coverage of the 2020 US Presidential Election Candidates through the Lens of Google’s Top Stories. In Proceedings of the International AAAI Conference on Web and Social Media, Vol. 14. 868–877.
- Bias out-of-the-box: An empirical analysis of intersectional occupational biases in popular generative language models. Advances in neural information processing systems 34 (2021), 2611–2624.
- Gender bias and stereotypes in Large Language Models. In Proceedings of The ACM Collective Intelligence Conference. 12–24.
- Heather Kugelmass. 2016. “Sorry, I’m Not Accepting New Patients” an audit study of access to mental health care. Journal of Health and Social Behavior 57, 2 (2016), 168–183.
- CoAuthor: Designing a Human-AI Collaborative Writing Dataset for Exploring Language Model Capabilities. In Proceedings of the 2022 CHI Conference on Human Factors in Computing Systems (CHI ’22). Association for Computing Machinery, New York, NY, USA, Article 388, 19 pages. https://doi.org/10.1145/3491102.3502030
- Giulia Leoni and Lee D Parker. 2019. Governance and control of sharing economy platforms: Hosting on Airbnb. The British Accounting Review 51, 6 (2019), 100814.
- Richard M Levinson. 1975. Sex discrimination and employment practices: An experiment with unconventional job inquiries. Social Problems 22, 4 (1975), 533–543.
- Louis Lippens. 2024. Computer says ‘no’: Exploring systemic bias in ChatGPT using an audit approach. Computers in Human Behavior: Artificial Humans 2, 1 (2024), 100054. https://doi.org/10.1016/j.chbah.2024.100054
- Intersectional bias in causal language models. arXiv preprint arXiv:2107.07691 (2021).
- How Search Engines Disseminate Information about COVID-19 and Why They Should Do Better. Harvard Kennedy School Misinformation Review (May 2020). https://doi.org/10.37016/mr-2020-017
- Michela Menegatti and Monica Rubini. 2017. Gender bias and sexism in language. In Oxford research encyclopedia of communication.
- An image of society: Gender and racial representation and impact in image search results for occupations. Proceedings of the ACM on Human-Computer Interaction 5, CSCW1 (2021), 1–23.
- Search Media and Elections: A Longitudinal Investigation of Political Search Results in the 2018 U.S. Elections. In Proceedings of the 22nd ACM Conference on Computer-Supported Cooperative Work and Social Computing. ACM.
- Auditing algorithms: Understanding algorithmic systems from the outside in. Foundations and Trends® in Human–Computer Interaction 14, 4 (2021), 272–344.
- Detecting price and search discrimination on the internet. In Proceedings of the 11th ACM workshop on hot topics in networks. 79–84.
- Co-Writing Screenplays and Theatre Scripts with Language Models: Evaluation by Industry Professionals. In Proceedings of the 2023 CHI Conference on Human Factors in Computing Systems. 1–34.
- Emma Mishel. 2016. Discrimination against queer women in the US workforce: A résumé audit study. Socius 2 (2016), 2378023115621316.
- Nhan Nguyen and Sarah Nadi. 2022. An empirical evaluation of GitHub copilot’s code suggestions. In Proceedings of the 19th International Conference on Mining Software Repositories. 1–5.
- Safiya Umoja Noble. 2013. Google search: Hyper-visibility as a means of rendering black women and girls invisible. InVisible Culture 19 (2013).
- Dissecting Racial Bias in an Algorithm Used to Manage the Health of Populations. Science 366, 6464 (Oct. 2019), 447–453. https://doi.org/10.1126/science.aax2342
- NYC Department of Consumer and Worker Protection. 2023. Automated Employment Decision Tools. § 5-301 Bias Audit (2023). https://codelibrary.amlegal.com/codes/newyorkcity/latest/NYCrules/0-0-0-138391
- Cathy O’Neil. 2016. Weapons of math destruction: How big data increases inequality and threatens democracy. Broadway books.
- Philip Oreopoulos. 2011. Why do skilled immigrants struggle in the labor market? A field experiment with thirteen thousand resumes. American Economic Journal: Economic Policy 3, 4 (2011), 148–71.
- Mitigating bias in algorithmic hiring: Evaluating claims and practices. In Proceedings of the 2020 conference on fairness, accountability, and transparency. 469–481.
- Discrimination towards disabled people seeking employment. Social Science & Medicine 35, 8 (1992), 951–958.
- Partha Pratim Ray. 2023. ChatGPT: A comprehensive review on background, applications, key challenges, bias, ethics, limitations and future scope. Internet of Things and Cyber-Physical Systems (2023).
- Resume Format, LinkedIn URLs and Other Unexpected Influences on AI Personality Prediction in Hiring: Results of an Audit. In Proceedings of the 2022 AAAI/ACM Conference on AI, Ethics, and Society. 572–587.
- Peter A Riach and Judith Rich. 2010. An experimental investigation of age discrimination in the English labor market. Annals of Economics and Statistics/Annales d’économie et de Statistique (2010), 169–185.
- Lauren A Rivera and András Tilcsik. 2016. Class advantage, commitment penalty: The gendered effect of social class signals in an elite labor market. American Sociological Review 81, 6 (2016), 1097–1131.
- What does it mean to’solve’the problem of discrimination in hiring? Social, technical and legal perspectives from the UK on automated hiring systems. In Proceedings of the 2020 conference on fairness, accountability, and transparency. 458–468.
- Auditing Algorithms: Research Methods for Detecting Discrimination on Internet Platforms. Data and discrimination: converting critical concerns into productive inquiry 22 (2014).
- Insurance, race/ethnicity, and sex in the search for a new physician. Economics Letters 137 (2015), 150–153.
- Efficient Resume Classification through Rapid Dataset Creation Using ChatGPT. In 2023 14th International Conference on Information, Intelligence, Systems & Applications (IISA). IEEE, 1–5.
- A Silicon Valley love triangle: Hiring algorithms, pseudo-science, and the quest for auditability. Patterns 3, 2 (2022).
- Potential for discrimination in online targeted advertising. In Conference on Fairness, Accountability, and Transparency (FAT ’20.
- U.S. Bureau Of Labor Statistics. 2023a. Labor force characteristics by race and ethnicity, 2021. (2023). https://www.bls.gov/opub/reports/race-and-ethnicity/2021/home.htm
- U.S. Bureau Of Labor Statistics. 2023b. Labor Force Statistics from the Current Population Survey. (2023). https://www.bls.gov/cps/cpsaat11.htm
- Interactive and visual prompt engineering for ad-hoc task adaptation with large language models. IEEE transactions on visualization and computer graphics 29, 1 (2022), 1146–1156.
- Latanya Sweeney. 2013. Discrimination in Online Ad Delivery. Queue 11, 3, Article 10 (March 2013), 20 pages. https://doi.org/10.1145/2460276.2460278
- Daniel Trielli and Nicholas Diakopoulos. 2019. Search as News Curator: The Role of Google in Shaping Attention to News Information. In Proceedings of the 2019 CHI Conference on Human Factors in Computing Systems - CHI ’19. ACM Press, Glasgow, Scotland Uk, 1–15. https://doi.org/10.1145/3290605.3300683
- Jospeh Tussman and Jacobus TenBroek. 1949. The equal protection of the laws. Calif. L. Rev. 37 (1949), 341.
- Are Emily and Greg Still More Employable than Lakisha and Jamal? Investigating Algorithmic Hiring Bias in the Era of ChatGPT. arXiv preprint arXiv:2310.05135 (2023).
- Discrimination of tenants with a visual impairment on the housing market: Empirical evidence from correspondence tests. Disability and health journal 9, 2 (2016), 234–238.
- Doris Weichselbaumer. 2016. Discrimination against female migrants wearing headscarves. (2016).
- Ethical and social risks of harm from language models. arXiv preprint arXiv:2112.04359 (2021).
- Ronald E Wienk. 1979. Measuring racial discrimination in American housing markets: The housing market practices survey. Vol. 444. Department of Housing and Urban Development, Office of Policy Development.
- Wordcraft: story writing with large language models. In 27th International Conference on Intelligent User Interfaces. 841–852.
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Top Community Prompts
Collections
Sign up for free to add this paper to one or more collections.