Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
102 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

CALM : A Multi-task Benchmark for Comprehensive Assessment of Language Model Bias (2308.12539v3)

Published 24 Aug 2023 in cs.CL, cs.AI, and cs.LG

Abstract: As LLMs (LMs) become increasingly powerful and widely used, it is important to quantify them for sociodemographic bias with potential for harm. Prior measures of bias are sensitive to perturbations in the templates designed to compare performance across social groups, due to factors such as low diversity or limited number of templates. Also, most previous work considers only one NLP task. We introduce Comprehensive Assessment of LLMs (CALM) for robust measurement of two types of universally relevant sociodemographic bias, gender and race. CALM integrates sixteen datasets for question-answering, sentiment analysis and natural language inference. Examples from each dataset are filtered to produce 224 templates with high diversity (e.g., length, vocabulary). We assemble 50 highly frequent person names for each of seven distinct demographic groups to generate 78,400 prompts covering the three NLP tasks. Our empirical evaluation shows that CALM bias scores are more robust and far less sensitive than previous bias measurements to perturbations in the templates, such as synonym substitution, or to random subset selection of templates. We apply CALM to 20 LLMs, and find that for 2 LLM series, larger parameter models tend to be more biased than smaller ones. The T0 series is the least biased model families, of the 20 LLMs investigated here. The code is available at https://github.com/vipulgupta1011/CALM.

Insights into Bias Evaluation in LLMs Using the CALM Dataset

The paper under discussion addresses the intricate challenge of assessing biases in LLMs (LMs) through the construction and evaluation of the Comprehensive Assessment of LLM (CALM) dataset. This paper highlights several notable findings and provides a critical analysis of the CALM dataset's efficacy in gauging biases, especially across various demographic dimensions like gender and race.

The creation of the CALM dataset is marked by a strategic selection of a target word list that emphasizes representation from seven social groups within the United States. This approach, while initially limited, offers a foundational step toward a broader geographic and cultural scope by incorporating names from various national origins. The authors provide scripts for replicating and evaluating LM biases across these diverse groups, although they recognize that the templates employed are solely in English, suggesting a potential for adaptation into other languages with careful linguistic and cultural considerations.

Central to the paper is the evaluation of several LMs on sentiment analysis tasks using the CALM dataset, as demonstrated in the gender-wise performance results table. For instance, results indicate minimal differences in sentiment analysis accuracy for models like Falcon-7B and Llama-2 across male, female, and gender-neutral categories. This finding suggests that increased data diversity within the CALM dataset may contribute to attenuating observed biases in model outputs.

Despite the dataset's potential to uncover biases, the paper acknowledges the intricacies involved in evaluating text generation models. A prominent limitation cited is the variability in baseline performance and bias severity across different models and tasks, which hampers comprehensive bias quantification. Additionally, the presence of overlapping names in gender and race categories introduces potential interdependencies in bias scores, indicating a need for innovative methodologies to separate these influences effectively.

Prompts play a critical role in model performance; thus, the authors have employed a 5-shot prompting technique, leveraging prompt structures from studies by Liang et al. (2022) and Brown et al. (2020). However, a note is made on the challenges of unknown training prompts for many LMs, advocating the need for prompt standardization to facilitate better cross-model comparability.

A speculative outlook on future research directions emphasizes the development of frameworks capable of integrating multiple tasks to derive comprehensive bias assessments. Additionally, the pursuit of methods to entirely disentangle bias categories and establishment of standardized prompts remains an essential frontier for enhancing the robustness and fairness of LM evaluations.

In theoretical and practical implications, the CALM dataset serves as a pivotal framework to refine bias assessment and mitigation strategies for LMs. It embodies an important progression in addressing the evolving biases as LLMs broaden their scope and capabilities. The research underscores a rigorous approach in applying comprehensive bias metrics and paves the way for nuanced understandings of biases in LMs, with significant implications for AI's role in addressing sociocultural disparities globally.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (73)
  1. Jaimeen Ahn and Alice Oh. 2021. Mitigating Language-Dependent Ethnic Bias in BERT. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, Online and Punta Cana, Dominican Republic, 533–549. https://doi.org/10.18653/v1/2021.emnlp-main.42
  2. Using Natural Sentence Prompts for Understanding Biases in Language Models. In Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. Association for Computational Linguistics, Seattle, United States, 2824–2830. https://doi.org/10.18653/v1/2022.naacl-main.203
  3. SODAPOP: Open-Ended Discovery of Social Biases in Social Commonsense Reasoning Models. In Proceedings of the 17th Conference of the European Chapter of the Association for Computational Linguistics. Association for Computational Linguistics, Dubrovnik, Croatia, 1573–1596. https://aclanthology.org/2023.eacl-main.116
  4. Unmasking Contextual Stereotypes: Measuring and Mitigating BERT’s Gender Bias. In Proceedings of the Second Workshop on Gender Bias in Natural Language Processing. Association for Computational Linguistics, Barcelona, Spain (Online), 1–16. https://aclanthology.org/2020.gebnlp-1.1
  5. On the Dangers of Stochastic Parrots: Can Language Models Be Too Big?. In Proceedings of the 2021 ACM Conference on Fairness, Accountability, and Transparency (Virtual Event, Canada) (FAccT ’21). Association for Computing Machinery, New York, NY, USA, 610–623. https://doi.org/10.1145/3442188.3445922
  6. Jayadev Bhaskaran and Isha Bhallamudi. 2019. Good secretaries, bad truck drivers? Occupational gender stereotypes in sentiment analysis. In Proceedings of the First Workshop on Gender Bias in Natural Language Processing. Association for Computational Linguistics, Florence, Italy, 62–68. https://doi.org/10.18653/v1/W19-3809
  7. GPT-Neo: Large Scale Autoregressive Language Modeling with Mesh-Tensorflow. Eleuther AI. https://doi.org/10.5281/zenodo.5297715 If you use this software, please cite it using these metadata..
  8. Language (technology) is power: a critical survey of “bias” in NLP. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. Association for Computational Linguistics, Online, 5454–5476. https://doi.org/10.18653/v1/2020.acl-main.485
  9. Man is to Computer Programmer as Woman is to Homemaker? Debiasing Word Embeddings. In Advances in Neural Information Processing Systems, D. Lee, M. Sugiyama, U. Luxburg, I. Guyon, and R. Garnett (Eds.), Vol. 29. Curran Associates, Inc., Red Hook, NY, USA. https://proceedings.neurips.cc/paper_files/paper/2016/file/a486cd07e4ac3d270571622f4f316ec5-Paper.pdf
  10. A large annotated corpus for learning natural language inference. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, Lisbon, Portugal, 632–642. https://doi.org/10.18653/v1/D15-1075
  11. Language models are few-shot learners. Advances in neural information processing systems 33 (2020), 1877–1901.
  12. Semantics derived automatically from language corpora contain human-like biases. Science 356, 6334 (2017), 183–186.
  13. On Measuring Gender Bias in Translation of Gender-neutral Pronouns. In Proceedings of the First Workshop on Gender Bias in Natural Language Processing. Association for Computational Linguistics, Florence, Italy, 173–181. https://doi.org/10.18653/v1/W19-3824
  14. Quantifying social biases in NLP: A generalization and empirical comparison of extrinsic fairness metrics. Transactions of the Association for Computational Linguistics 9 (2021), 1249–1267.
  15. On Measuring and Mitigating Biased Inferences of Word Embeddings. Proceedings of the AAAI Conference on Artificial Intelligence 34, 05 (Apr. 2020), 7659–7666. https://doi.org/10.1609/aaai.v34i05.6267
  16. Sunipa Dev and Jeff M. Phillips. 2019. Attenuating Bias in Word Vectors. CoRR abs/1901.07656 (2019). arXiv:1901.07656 http://arxiv.org/abs/1901.07656
  17. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers). Association for Computational Linguistics, Minneapolis, Minnesota, 4171–4186. https://doi.org/10.18653/v1/N19-1423
  18. Yanai Elazar and Yoav Goldberg. 2018. Adversarial Removal of Demographic Attributes from Text Data. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, Brussels, Belgium, 11–21. https://doi.org/10.18653/v1/D18-1002
  19. Jacob Feldman. 2015. There Are 922 Unisex Names in America — Is Yours One of Them? https://fivethirtyeight.com/features/there-are-922-unisex-names-in-america-is-yours-one-of-them/
  20. Red teaming language models to reduce harms: Methods, scaling behaviors, and lessons learned.
  21. Twitter sentiment classification using distant supervision. CS224N project report, Stanford 1, 12 (2009), 2009.
  22. Intrinsic Bias Metrics Do Not Correlate with Application Bias. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers). Association for Computational Linguistics, Online, 1926–1940. https://doi.org/10.18653/v1/2021.acl-long.150
  23. SwapMix: Diagnosing and Regularizing the Over-Reliance on Visual Context in Visual Question Answering. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). CVPR, New Orleans, USA, 5078–5088.
  24. Survey on Sociodemographic Bias in Natural Language Processing. arXiv preprint arXiv:2306.08158 (2023).
  25. “Fifty Shades of Bias”: Normative Ratings of Gender Bias in GPT Generated English Text. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, Houda Bouamor, Juan Pino, and Kalika Bali (Eds.). Association for Computational Linguistics, Singapore, 1862–1876. https://doi.org/10.18653/v1/2023.emnlp-main.115
  26. Helen. 2018. Very Large Language Models and How to Evaluate Them. https://huggingface.co/blog/zero-shot-eval-on-the-hub
  27. Reducing Sentiment Bias in Language Models via Counterfactual Evaluation. In Findings of the Association for Computational Linguistics: EMNLP 2020. Association for Computational Linguistics, Online, 65–83. https://doi.org/10.18653/v1/2020.findings-emnlp.7
  28. Christina Jewett. 2023. Doctors Wrestle With A.I. in Patient Care, Citing Lax Oversight. https://www.nytimes.com/2023/10/30/health/doctors-ai-technology-health-care.html
  29. Jigsaw. 2018. Jigsaw Toxic Comment Classification Challenge. https://www.kaggle.com/c/jigsaw-toxic-comment-classification-challenge
  30. Svetlana Kiritchenko and Saif Mohammad. 2018. Examining Gender and Race Bias in Two Hundred Sentiment Analysis Systems. In Proceedings of the Seventh Joint Conference on Lexical and Computational Semantics. Association for Computational Linguistics, New Orleans, Louisiana, 43–53. https://doi.org/10.18653/v1/S18-2005
  31. Measuring Bias in Contextualized Word Representations. In Proceedings of the First Workshop on Gender Bias in Natural Language Processing. Association for Computational Linguistics, Florence, Italy, 166–172. https://doi.org/10.18653/v1/W19-3823
  32. Zero-Shot Relation Extraction via Reading Comprehension. In Proceedings of the 21st Conference on Computational Natural Language Learning (CoNLL 2017). Association for Computational Linguistics, Vancouver, Canada, 333–342. https://doi.org/10.18653/v1/K17-1034
  33. Comparing Biases and the Impact of Multilingual Training across Multiple Languages. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, Houda Bouamor, Juan Pino, and Kalika Bali (Eds.). Association for Computational Linguistics, Singapore, 10260–10280. https://doi.org/10.18653/v1/2023.emnlp-main.634
  34. Out of Context: Investigating the Bias and Fairness Concerns of “Artificial Intelligence as a Service”. In Proceedings of the 2023 CHI Conference on Human Factors in Computing Systems. Association for Computing Machinery, New York, NY, USA, 1–17.
  35. UNQOVERing Stereotyping Biases via Underspecified Questions. In Findings of the Association for Computational Linguistics: EMNLP 2020. Association for Computational Linguistics, Online, 3475–3489. https://doi.org/10.18653/v1/2020.findings-emnlp.311
  36. Holistic Evaluation of Language Models. Transactions on Machine Learning Research (2023). https://openreview.net/forum?id=iO4LZibEqW Featured Certification, Expert Certification.
  37. A SICK cure for the evaluation of compositional distributional semantic models. In Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC’14). European Language Resources Association (ELRA), Reykjavik, Iceland, 216–223. http://www.lrec-conf.org/proceedings/lrec2014/pdf/363_Paper.pdf
  38. Hatexplain: A benchmark dataset for explainable hate speech detection. In Proceedings of the AAAI conference on artificial intelligence, Vol. 35. AAAI, online, 14867–14875.
  39. Crowdsourcing Question-Answer Meaning Representations. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 2 (Short Papers). Association for Computational Linguistics, New Orleans, Louisiana, 560–568. https://doi.org/10.18653/v1/N18-2089
  40. MCScript: A Novel Dataset for Assessing Machine Comprehension Using Script Knowledge. In Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018). European Language Resources Association (ELRA), Miyazaki, Japan. https://aclanthology.org/L18-1564
  41. BBQ: A hand-built bias benchmark for question answering. In Findings of the Association for Computational Linguistics: ACL 2022. Association for Computational Linguistics, Dublin, Ireland, 2086–2105. https://doi.org/10.18653/v1/2022.findings-acl.165
  42. The RefinedWeb dataset for Falcon LLM: outperforming curated corpora with web data, and web data only. arXiv preprint arXiv:2306.01116 (2023). arXiv:2306.01116 https://arxiv.org/abs/2306.01116
  43. Red Teaming Language Models with Language Models. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, Yoav Goldberg, Zornitsa Kozareva, and Yue Zhang (Eds.). Association for Computational Linguistics, Abu Dhabi, United Arab Emirates, 3419–3448. https://doi.org/10.18653/v1/2022.emnlp-main.225
  44. Perturbation Sensitivity Analysis to Detect Unintended Model Biases. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP). Association for Computational Linguistics, Hong Kong, China, 5740–5745. https://doi.org/10.18653/v1/D19-1578
  45. MCTest: A Challenge Dataset for the Open-Domain Machine Comprehension of Text. In Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing, David Yarowsky, Timothy Baldwin, Anna Korhonen, Karen Livescu, and Steven Bethard (Eds.). Association for Computational Linguistics, Seattle, Washington, USA, 193–203. https://aclanthology.org/D13-1020
  46. Gender Bias in Coreference Resolution. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 2 (Short Papers), Marilyn Walker, Heng Ji, and Amanda Stent (Eds.). Association for Computational Linguistics, New Orleans, Louisiana, 8–14. https://doi.org/10.18653/v1/N18-2002
  47. DuoRC: Towards Complex Language Understanding with Paraphrased Reading Comprehension. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Association for Computational Linguistics, Melbourne, Australia, 1683–1693. https://doi.org/10.18653/v1/P18-1156
  48. WinoGrande: An Adversarial Winograd Schema Challenge at Scale. Commun. ACM 64, 9 (aug 2021), 99–106. https://doi.org/10.1145/3474381
  49. Multitask Prompted Training Enables Zero-Shot Task Generalization. arXiv:2110.08207 [cs.LG]
  50. Social IQa: Commonsense Reasoning about Social Interactions. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP). Association for Computational Linguistics, Hong Kong, China, 4463–4473. https://doi.org/10.18653/v1/D19-1454
  51. Bloom: A 176b-parameter open-access multilingual language model. arXiv preprint arXiv:2211.05100 (2022).
  52. A proposal for identifying and managing bias in artificial intelligence. Draft NIST Special Publication 1270 (2021).
  53. The Tail Wagging the Dog: Dataset Construction Biases of Social Bias Benchmarks. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers). Association for Computational Linguistics, Toronto, Canada, 1373–1386. https://aclanthology.org/2023.acl-short.118
  54. Quantifying Social Biases Using Templates is Unreliable. In Workshop on Trustworthy and Socially Responsible Machine Learning, NeurIPS 2022. https://openreview.net/forum?id=rIhzjia7SLa
  55. “I’m sorry to hear that”: Finding New Biases in Language Models with a Holistic Descriptor Dataset. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, Abu Dhabi, United Arab Emirates, 9180–9211. https://aclanthology.org/2022.emnlp-main.625
  56. Recursive Deep Models for Semantic Compositionality Over a Sentiment Treebank. In Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, Seattle, Washington, USA, 1631–1642. https://aclanthology.org/D13-1170
  57. Evaluating Gender Bias in Machine Translation. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics. Association for Computational Linguistics, Florence, Italy, 1679–1684. https://doi.org/10.18653/v1/P19-1164
  58. Yi Chern Tan and L. Elisa Celis. 2019. Assessing Social and Intersectional Biases in Contextualized Word Representations. In Advances in Neural Information Processing Systems, Vol. 32. Curran Associates, Inc., Red Hook, NY, USA. https://proceedings.neurips.cc/paper/2019/hash/201d546992726352471cfea6b0df0a48-Abstract.html
  59. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288 (2023).
  60. A study of implicit bias in pretrained language models against people with disabilities. In Proceedings of the 29th International Conference on Computational Linguistics. International Committee on Computational Linguistics, Gyeongju, Republic of Korea, 1324–1332.
  61. Automated Ableism: An Exploration of Explicit Disability Biases in Sentiment and Toxicity Analysis Models. In Proceedings of the 3rd Workshop on Trustworthy Natural Language Processing (TrustNLP 2023). Association for Computational Linguistics, Toronto, Canada, 26–34.
  62. Pranav Narayanan Venkit and Shomir Wilson. 2021. Identification of bias against people with disabilities in sentiment analysis and toxicity detection models. arXiv preprint arXiv:2111.13259 (2021).
  63. Superglue: A stickier benchmark for general-purpose language understanding systems. Advances in neural information processing systems 32, Article 294 (2019), 15 pages.
  64. GLUE: A Multi-Task Benchmark and Analysis Platform for Natural Language Understanding. In Proceedings of the 2018 EMNLP Workshop BlackboxNLP: Analyzing and Interpreting Neural Networks for NLP, Tal Linzen, Grzegorz Chrupała, and Afra Alishahi (Eds.). Association for Computational Linguistics, Brussels, Belgium, 353–355. https://doi.org/10.18653/v1/W18-5446
  65. GLUE: A Multi-Task Benchmark and Analysis Platform for Natural Language Understanding. In Proceedings of the 2018 EMNLP Workshop BlackboxNLP: Analyzing and Interpreting Neural Networks for NLP. Association for Computational Linguistics, Brussels, Belgium, 353–355. https://doi.org/10.18653/v1/W18-5446
  66. Mind the GAP: A Balanced Corpus of Gendered Ambiguous Pronouns. Transactions of the Association for Computational Linguistics 6 (2018), 605–617. https://doi.org/10.1162/tacl_a_00240
  67. Measuring and Reducing Gendered Correlations in Pre-trained Models. http://arxiv.org/abs/2010.06032 arXiv:2010.06032 [cs].
  68. Towards AI-Complete Question Answering: A Set of Prerequisite Toy Tasks. In 4th International Conference on Learning Representations, ICLR 2016, San Juan, Puerto Rico, May 2-4, 2016, Conference Track Proceedings, Yoshua Bengio and Yann LeCun (Eds.). http://arxiv.org/abs/1502.05698
  69. TWEETQA: A Social Media Focused Question Answering Dataset. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics. Association for Computational Linguistics, Florence, Italy, 5020–5031. https://doi.org/10.18653/v1/P19-1496
  70. Opt: Open pre-trained transformer language models. arXiv preprint arXiv:2205.01068 (2022).
  71. BERTScore: Evaluating Text Generation with BERT. In International Conference on Learning Representations. ICLR, Online. https://openreview.net/forum?id=SkeHuCVFDr
  72. Gender Bias in Coreference Resolution: Evaluation and Debiasing Methods. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 2 (Short Papers). Association for Computational Linguistics, New Orleans, Louisiana, 15–20. https://doi.org/10.18653/v1/N18-2003
  73. Red teaming ChatGPT via Jailbreaking: Bias, Robustness, Reliability and Toxicity. arXiv preprint arXiv:2301.12867 (2023).
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (5)
  1. Vipul Gupta (31 papers)
  2. Pranav Narayanan Venkit (19 papers)
  3. Hugo Laurençon (11 papers)
  4. Shomir Wilson (20 papers)
  5. Rebecca J. Passonneau (14 papers)
Citations (6)
Github Logo Streamline Icon: https://streamlinehq.com

GitHub

X Twitter Logo Streamline Icon: https://streamlinehq.com
Youtube Logo Streamline Icon: https://streamlinehq.com