Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
41 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

"Kelly is a Warm Person, Joseph is a Role Model": Gender Biases in LLM-Generated Reference Letters (2310.09219v5)

Published 13 Oct 2023 in cs.CL and cs.AI

Abstract: LLMs have recently emerged as an effective tool to assist individuals in writing various types of content, including professional documents such as recommendation letters. Though bringing convenience, this application also introduces unprecedented fairness concerns. Model-generated reference letters might be directly used by users in professional scenarios. If underlying biases exist in these model-constructed letters, using them without scrutinization could lead to direct societal harms, such as sabotaging application success rates for female applicants. In light of this pressing issue, it is imminent and necessary to comprehensively study fairness issues and associated harms in this real-world use case. In this paper, we critically examine gender biases in LLM-generated reference letters. Drawing inspiration from social science findings, we design evaluation methods to manifest biases through 2 dimensions: (1) biases in language style and (2) biases in lexical content. We further investigate the extent of bias propagation by analyzing the hallucination bias of models, a term that we define to be bias exacerbation in model-hallucinated contents. Through benchmarking evaluation on 2 popular LLMs- ChatGPT and Alpaca, we reveal significant gender biases in LLM-generated recommendation letters. Our findings not only warn against using LLMs for this application without scrutinization, but also illuminate the importance of thoroughly studying hidden biases and harms in LLM-generated professional documents.

Gender Biases in LLM-Generated Reference Letters: A Critical Analysis

The paper "Kelly is a Warm Person, Joseph is a Role Model: Gender Biases in LLM-Generated Reference Letters" presents an in-depth investigation into the gender biases manifest in reference letters produced by LLMs, specifically exemplified by models such as ChatGPT and Alpaca. This paper is critically important, as it addresses the significant and often overlooked issue of fairness and bias in automated text generation, with profound implications for real-world professional scenarios, including hiring and admissions processes.

Methodological Approach

The researchers identified two primary scenarios for evaluating bias in LLM-generated reference letters: Context-Less Generation (CLG) and Context-Based Generation (CBG). In CLG, the model generates letters based solely on minimal input, such as a name and gender, to evaluate inherent biases. CBG offers a more complex prompt, incorporating personal and professional biographical details, simulating a realistic application of LLMs where users provide comprehensive information.

The paper draws upon social science frameworks to assess biases along two dimensions: language style and lexical content. It also considers "hallucination bias," a novel concept introduced here, which refers to bias exacerbation in the non-entailment, generated content that deviates from factual input, revealing how LLMs might amplify existing stereotypes during text generation.

Key Findings

The paper's findings unequivocally reveal persistent gender biases across both CLG and CBG scenarios:

  1. Lexical Content Biases: The odds ratios calculated show a significant inclination in word choice reflecting gender stereotypes—male names are associated with terms like "leader" or "genius," whereas female names align more frequently with words like "interpersonal" or "warm." These biases echo prevalent societal stereotypes documented in psycholinguistic studies.
  2. Language Style Biases: Analysis shows that LLMs engender reference letters where male candidates are described with more formal, positive, and agentic language compared to their female counterparts. For example, language describing men often aligns more closely with traits valued in professional settings, like assertiveness and leadership.
  3. Hallucination Bias: The analysis of inconsistencies indicates that hallucinated content often exacerbates existing gender biases, where additional unsubstantiated details skew gender representation unfavorably, further entrenching stereotypes.

Implications and Future Directions

The implications of these biases in LLM-generated texts are far-reaching. The automated generation of biased reference letters can influence critical decisions in employment and academia, potentially perpetuating gender inequality without conscious intervention. Recognizing and mitigating these biases is crucial in ensuring fair and equitable AI tools.

Future work in this field must focus on developing frameworks to correct biases in LLM outputs. This includes enhancing dataset diversity, refining model training techniques to diminish bias emergence, and potentially incorporating bias-check mechanisms during generation. Additionally, extending this research paradigm to other demographic intersections, such as race or ethnicity, would provide a broader understanding of representation biases in LLMs.

The paper effectively highlights the need for critical engagement with AI technologies tasked with generating professional documentation. While LLMs present enormous potential in automating writing, the ethical and societal consequences of their inherent biases cannot be ignored. Consequently, this paper calls for more rigorous academic and policy-driven discourse to navigate the implications of AI in society responsibly.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (73)
  1. Stability AI. 2023. Stability ai launches the first of its stablelm suite of language models.
  2. Large language models and the perils of their hallucinations. Critical Care, 27(1):1–2.
  3. A multitask, multilingual, multimodal evaluation of chatgpt on reasoning, hallucination, and interactivity.
  4. The problem with bias: From allocative to representational harms in machine learning. In Proceedings of the 9th Annual Conference of the Special Interest Group for Computing, Information and Society (SIGCIS), Philadelphia, PA. Association for Computational Linguistics.
  5. On the dangers of stochastic parrots: Can language models be too big? In Proceedings of the 2021 ACM Conference on Fairness, Accountability, and Transparency, FAccT ’21, page 610–623, New York, NY, USA. Association for Computing Machinery.
  6. Language (technology) is power: A critical survey of “bias” in NLP. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 5454–5476, Online. Association for Computational Linguistics.
  7. Shikha Bordia and Samuel R. Bowman. 2019. Identifying and reducing gender bias in word-level language models. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Student Research Workshop, pages 7–15, Minneapolis, Minnesota. Association for Computational Linguistics.
  8. A large annotated corpus for learning natural language inference. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing (EMNLP). Association for Computational Linguistics.
  9. Language models are few-shot learners. In Advances in Neural Information Processing Systems, volume 33, pages 1877–1901. Curran Associates, Inc.
  10. Man is to computer programmer as woman is to homemaker? debiasing word embeddings. In Conference on Neural Information Processing Systems.
  11. Bold: Dataset and metrics for measuring biases in open-ended language generation. In FAccT.
  12. Mitigating gender bias in distilled language models via counterfactual role reversal. In Findings of the Association for Computational Linguistics: ACL 2022, pages 658–678, Dublin, Ireland. Association for Computational Linguistics.
  13. DExperts: Decoding-time controlled text generation with experts and anti-experts. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pages 6691–6706, Online. Association for Computational Linguistics.
  14. On the intrinsic and extrinsic fairness evaluation metrics for contextualized language representations. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pages 561–570, Dublin, Ireland. Association for Computational Linguistics.
  15. Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality.
  16. Kate Crawford. 2017. The trouble with bias. In Conference on Neural Information Processing Systems, invited speaker.
  17. Melissa Cugno. 2020. Talk Like a Man: How Resume Writing Can Impact Managerial Hiring Decisions for Women. Ph.D. thesis. Copyright - Database copyright ProQuest LLC; ProQuest does not claim copyright in the individual underlying works; Last updated - 2023-03-07.
  18. Bias in bios: A case study of semantic representation bias in a high-stakes setting. In proceedings of the Conference on Fairness, Accountability, and Transparency, pages 120–128.
  19. Bias in bios: A case study of semantic representation bias in a high-stakes setting. In Proceedings of the Conference on Fairness, Accountability, and Transparency, FAT* ’19, page 120–128, New York, NY, USA. Association for Computing Machinery.
  20. On measures of biases and harms in NLP. In Findings of the Association for Computational Linguistics: AACL-IJCNLP 2022, pages 246–267, Online only. Association for Computational Linguistics.
  21. Bold: Dataset and metrics for measuring biases in open-ended language generation. In Proceedings of the 2021 ACM Conference on Fairness, Accountability, and Transparency, FAccT ’21, page 862–872, New York, NY, USA. Association for Computing Machinery.
  22. Queens are powerful too: Mitigating gender bias in dialogue generation. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 8173–8188, Online. Association for Computational Linguistics.
  23. Measuring and mitigating unintended bias in text classification. New York, NY, USA. Association for Computing Machinery.
  24. Gender differences in recommendation letters for postdoctoral fellowships in geoscience. Nature Geoscience, 9:805–808.
  25. Neural path hunter: Reducing hallucination in dialogue systems via path grounding. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 2197–2214, Online and Punta Cana, Dominican Republic. Association for Computational Linguistics.
  26. News summarization and evaluation in the era of gpt-3.
  27. Heat and moisture exchanger occlusion leading to sudden increased airway pressure: A case report using chatgpt as a personal writing assistant. Cureus, 15(4).
  28. Joint event and temporal relation extraction with shared representations and structured prediction. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 434–444, Hong Kong, China. Association for Computational Linguistics.
  29. Matthew Honnibal and Ines Montani. 2017. spaCy 2: Natural language understanding with Bloom embeddings, convolutional neural networks and incremental parsing. To appear.
  30. The factual inconsistency problem in abstractive text summarization: A survey. ArXiv, abs/2104.14839.
  31. Survey of hallucination in natural language generation. ACM Computing Surveys, 55(12):1–38.
  32. Is chatgpt a good translator? yes with gpt-4 as the engine.
  33. Gender bias in reference letters for residency and academic medicine: a systematic review. Postgraduate Medical Journal.
  34. Evaluating the factual consistency of abstractive text summarization. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 9332–9346, Online. Association for Computational Linguistics.
  35. Summac: Re-visiting nli-based models for inconsistency detection in summarization. Transactions of the Association for Computational Linguistics, 10:163–177.
  36. Generating scholarly content with chatgpt: ethical challenges for medical publishing. The Lancet Digital Health, 5.
  37. Using the standardized letters of recommendation in selectionresults from a multidimensional rasch model. Educational and Psychological Measurement - EDUC PSYCHOL MEAS, 69:475–492.
  38. Raising doubt in letters of recommendation for academia: Gender differences and their impact. Journal of Business and Psychology, 34.
  39. Gender and letters of recommendation for academia: Agentic and communal differences. The Journal of applied psychology, 94:1591–9.
  40. On faithfulness and factuality in abstractive summarization. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 1906–1919, Online. Association for Computational Linguistics.
  41. Self-contradictory hallucinations of large language models: Evaluation, detection and mitigation. arXiv preprint arXiv:2305.15852.
  42. Adversarial nli: A new benchmark for natural language understanding. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. Association for Computational Linguistics.
  43. OpenAI. 2022. Introducing chatgpt.
  44. Can gpt-3 write an academic paper on itself, with minimal human input?
  45. “i’m fully who i am”: Towards centering transgender and non-binary voices to measure biases in open language generation. In Proceedings of the 2023 ACM Conference on Fairness, Accountability, and Transparency, FAccT ’23, page 1246–1266, New York, NY, USA. Association for Computing Machinery.
  46. Factoring the matrix of domination: A critical review and reimagination of intersectionality in ai fairness.
  47. Assessing gender bias in machine translation: a case study with google translate. Neural Computing and Applications, 32:6363–6381.
  48. Sudha Rao and Joel Tetreault. 2018. Dear sir or madam, may I introduce the GYAFC dataset: Corpus, benchmarks and metrics for formality style transfer. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), pages 129–140, New Orleans, Louisiana. Association for Computational Linguistics.
  49. Beyond accuracy: Behavioral testing of NLP models with CheckList. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 4902–4912, Online. Association for Computational Linguistics.
  50. Object hallucination in image captioning. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 4035–4045, Brussels, Belgium. Association for Computational Linguistics.
  51. Malik Sallam. 2023. Chatgpt utility in healthcare education, research, and practice: Systematic review on the promising perspectives and valid concerns. Healthcare, 11(6).
  52. Towards Controllable Biases in Language Generation. In Findings of the Association for Computational Linguistics: EMNLP 2020, pages 3239–3254, Online. Association for Computational Linguistics.
  53. “nice try, kiddo”: Investigating ad hominems in dialogue responses. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 750–767, Online. Association for Computational Linguistics.
  54. Societal biases in language generation: Progress and challenges. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pages 4275–4293, Online. Association for Computational Linguistics.
  55. The woman worked as a babysitter: On biases in language generation. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 3407–3412, Hong Kong, China. Association for Computational Linguistics.
  56. Recursive deep models for semantic compositionality over a sentiment treebank. In Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing, pages 1631–1642, Seattle, Washington, USA. Association for Computational Linguistics.
  57. Chris Stokel-Walker. 2023. Chatgpt listed as author on research papers: Many scientists disapprove. Nature, 613(7945):620–621.
  58. Fan-Keng Sun and Cheng-I Lai. 2020. Conditioned natural language generation using only unconditioned language model: An exploration. ArXiv, abs/2011.07347.
  59. Jiao Sun and Nanyun Peng. 2021. Men are elected, women are married: Events gender bias on Wikipedia. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 2: Short Papers), pages 350–360, Online. Association for Computational Linguistics.
  60. Magdalena Szumilas. 2010. Explaining odds ratios. Journal of the Canadian Academy of Child and Adolescent Psychiatry = Journal de l’Academie canadienne de psychiatrie de l’enfant et de l’adolescent, 19 3:227–9.
  61. Evaluation of chatgpt as a question answering system for answering complex questions.
  62. Stanford alpaca: An instruction-following llama model. https://github.com/tatsu-lab/stanford_alpaca.
  63. FEVER: a large-scale dataset for fact extraction and VERification. In NAACL-HLT.
  64. Frances Trix and Carolyn E. Psenka. 2003. Exploring the color of glass: Letters of recommendation for female and male medical faculty. Discourse & Society, 14:191 – 220.
  65. Asking and answering questions to evaluate the factual consistency of summaries. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 5008–5020, Online. Association for Computational Linguistics.
  66. Towards intersectionality in machine learning: Including more identities, handling underrepresentation, and performing evaluation. Proceedings of the 2022 ACM Conference on Fairness, Accountability, and Transparency.
  67. Zero-shot cross-lingual summarization via large language models.
  68. Self-instruct: Aligning language model with self generated instructions. arXiv preprint arXiv:2212.10560.
  69. A broad-coverage challenge corpus for sentence understanding through inference. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), pages 1112–1122. Association for Computational Linguistics.
  70. Exploring the limits of chatgpt for query or aspect-based text summarization.
  71. Benchmarking large language models for news summarization.
  72. Men also like shopping: Reducing gender bias amplification using corpus-level constraints. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, pages 2979–2989, Copenhagen, Denmark. Association for Computational Linguistics.
  73. Gender bias in coreference resolution: Evaluation and debiasing methods. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 2 (Short Papers), pages 15–20, New Orleans, Louisiana. Association for Computational Linguistics.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (6)
  1. Yixin Wan (19 papers)
  2. George Pu (7 papers)
  3. Jiao Sun (29 papers)
  4. Aparna Garimella (19 papers)
  5. Kai-Wei Chang (292 papers)
  6. Nanyun Peng (205 papers)
Citations (110)