Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
38 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

GPT is Not an Annotator: The Necessity of Human Annotation in Fairness Benchmark Construction (2405.15760v1)

Published 24 May 2024 in cs.CL and cs.CY
GPT is Not an Annotator: The Necessity of Human Annotation in Fairness Benchmark Construction

Abstract: Social biases in LLMs are usually measured via bias benchmark datasets. Current benchmarks have limitations in scope, grounding, quality, and human effort required. Previous work has shown success with a community-sourced, rather than crowd-sourced, approach to benchmark development. However, this work still required considerable effort from annotators with relevant lived experience. This paper explores whether an LLM (specifically, GPT-3.5-Turbo) can assist with the task of developing a bias benchmark dataset from responses to an open-ended community survey. We also extend the previous work to a new community and set of biases: the Jewish community and antisemitism. Our analysis shows that GPT-3.5-Turbo has poor performance on this annotation task and produces unacceptable quality issues in its output. Thus, we conclude that GPT-3.5-Turbo is not an appropriate substitute for human annotation in sensitive tasks related to social biases, and that its use actually negates many of the benefits of community-sourcing bias benchmarks.

GPT is Not an Annotator: The Necessity of Human Annotation in Fairness Benchmark Construction

The paper "GPT is Not an Annotator: The Necessity of Human Annotation in Fairness Benchmark Construction" by Virginia K. Felkner, Jennifer A. Thompson, and Jonathan May, critically assesses the effectiveness of LLMs like GPT-3.5-Turbo in assisting with the creation of bias benchmarks. This work specifically focuses on antisemitism within the Jewish community, resulting in the introduction of the WinoSemitism dataset. It draws its methodologies from earlier efforts such as WinoQueer, which involved community-sourced bias benchmarks for LGBTQ+ populations. The paper stresses that despite advancements in LLMs, human expertise remains indispensable for developing high-quality benchmarks for bias measurement.

Key Contributions and Methods

The contributions of this paper are multifaceted:

  1. Introduction of the WinoSemitism Benchmark: The authors curated a community-sourced dataset aimed at evaluating antisemitic biases in LLMs. This dataset was developed through meticulous community involvement to establish stereotypes and biases pertinent to the Jewish community.
  2. Development of GPT-Variant Datasets: Named GPT-WinoQueer and GPT-WinoSemitism, these datasets employ predicates extracted by GPT-3.5-Turbo instead of human annotation.
  3. Quantitative and Qualitative Analysis: A comparison between the human-annotated and GPT-annotated datasets demonstrated significant quality issues when using GPT-3.5-Turbo for predicate extraction.

Processes and Evaluation

Jewish Community Survey

The authors conducted a survey targeting Jewish respondents to elicit stereotypes and biases. The survey, conducted over ten weeks, received 203 responses predominantly from white, Ashkenazi, English-speaking, and U.S.-based participants. The survey involved demographic questions, multiple-choice questions about stereotypes, and open-ended questions for additional stereotypes.

Construction of WinoSemitism Benchmark

The WinoSemitism benchmark was constructed using:

  • Sentence Templates: Structures to frame the sentences.
  • Identity Descriptors and Counterfactual Descriptors: Labels like "Jewish people" and non-Jewish identifiers such as "Christian".
  • Common Jewish Names and Attested Harm Predicates: Identifiers selected for their syntactic correctness based on survey responses.

These elements formed pairs of stereotypical and counter-stereotypical sentences, yielding 58,816 sentence pairs, such as "All Jews are greedy" and "All Christians are greedy".

GPT Extraction and Benchmark Construction

To validate the possibility of reducing human annotation effort, GPT-3.5-Turbo was employed to extract harm predicates from survey data. These predicates were used in sentence templates identical to those used for human-annotated datasets, forming the GPT-WinoQueer and GPT-WinoSemitism datasets. The models yielded predicates that differed significantly in quality and accuracy.

Results

WinoSemitism Baseline

Evaluated across 20 different LLMs from seven families, including BERT, RoBERTa, ALBERT, BART, GPT-2, BLOOM, and OPT, the WinoSemitism baseline results revealed notable antisemitic bias. For example, average bias scores were:

  • BERT: 69.53
  • RoBERTa: 66.51
  • GPT-2: 70.11
  • BLOOM: 70.31

This suggests these models are considerably biased, more than twice likely to apply antisemitic stereotypes to Jews than non-Jews.

Predicate Extraction Performance

The GPT-extracted predicates suffered from multiple quality issues:

  1. Low Exact Match Rates: The rates of exact matches between human and GPT-extracted predicates were significantly low, e.g., 18.14% for WinoSemitism.
  2. Semantic and Syntactic Errors: High rates of predicates needing syntactic corrections were flagged in the analysis.
  3. Hallucinations: Notable instances where GPT generated stereotypes not present in the survey data.

Implications and Future Directions

Practical Implications

The findings indicate that GPT-3.5-Turbo is unable to replace human annotators effectively, failing to capture nuanced and context-specific biases accurately. The model often misrepresents or invents biases, particularly harmful in high-stakes scenarios of bias benchmark construction. Consequently, the use of LLMs for bias annotation could inadvertently introduce misleading and inaccurate benchmarks, negating the purpose of community-sourced benchmarks.

Theoretical Implications and Speculations for AI Development

From a theoretical perspective, the paper highlights the critical role of human annotators in tasks requiring deep contextual understanding and sensitivity. It suggests that while LLMs have advanced in many areas, they are still lacking in capabilities concerning nuanced and contextually rich tasks. Future AI developments could focus on improving the contextual understanding of these models to better handle such sensitive tasks. However, the emphasis on ethical considerations, particularly in AI applications related to social biases, should not be understated.

Conclusion

This research paper underlines the essential role of human annotation in creating high-quality, community-grounded fairness benchmarks for LLMs. Despite the potential of LLMs like GPT-3.5-Turbo, their use in extracting stereotypes for benchmark construction introduces significant inaccuracies and biases. Therefore, expert human involvement remains crucial, especially in tasks heavily reliant on contextual and sensitive understanding. This work sheds light on the limitations of current AI tools in domains requiring deep human insight and reaffirms the necessity of expert human intervention in constructing fairness benchmarks.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (22)
  1. Language (technology) is power: A critical survey of “bias” in NLP. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 5454–5476, Online. Association for Computational Linguistics.
  2. Stereotyping Norwegian salmon: An inventory of pitfalls in fairness benchmark datasets. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pages 1004–1015, Online. Association for Computational Linguistics.
  3. Theory-grounded measurement of U.S. social stereotypes in English language models. In Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 1276–1295, Seattle, United States. Association for Computational Linguistics.
  4. “subverting the jewtocracy”: Online antisemitism detection using multimodal deep learning. In Proceedings of the 13th ACM Web Science Conference 2021, WebSci ’21, page 148–157, New York, NY, USA. Association for Computing Machinery.
  5. BERT: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 4171–4186, Minneapolis, Minnesota. Association for Computational Linguistics.
  6. Bryan Dosono and Bryan Semaan. 2019. Moderation practices as emotional labor in sustaining online communities: The case of aapi identity work on reddit. In Proceedings of the 2019 CHI Conference on Human Factors in Computing Systems, CHI ’19, page 1–13, New York, NY, USA. Association for Computing Machinery.
  7. WinoQueer: A community-in-the-loop benchmark for anti-LGBTQ+ bias in large language models. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 9126–9140, Toronto, Canada. Association for Computational Linguistics.
  8. Kirsten Fermaglich. 2018. A Rosenberg by Any Other Name: A History of Jewish Name Changing in America, volume 9. NYU Press.
  9. Detecting anti-jewish messages on social media. building an annotated corpus that can serve as a preliminary gold standard. In ICWSM Workshops.
  10. Albert: A lite bert for self-supervised learning of language representations. In International Conference on Learning Representations.
  11. BART: Denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 7871–7880, Online. Association for Computational Linguistics.
  12. Roberta: A robustly optimized BERT pretraining approach. CoRR, abs/1907.11692.
  13. I Lisa McCann and Laurie Anne Pearlman. 1990. Vicarious traumatization: A framework for understanding the psychological effects of working with victims. Journal of traumatic stress, 3:131–149.
  14. StereoSet: Measuring stereotypical bias in pretrained language models. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pages 5356–5371, Online. Association for Computational Linguistics.
  15. CrowS-pairs: A challenge dataset for measuring social biases in masked language models. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 1953–1967, Online. Association for Computational Linguistics.
  16. Language models are unsupervised multitask learners.
  17. Nils Reimers and Iryna Gurevych. 2019. Sentence-bert: Sentence embeddings using siamese bert-networks. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics.
  18. Sociotechnical harms of algorithmic systems: Scoping a taxonomy for harm reduction. In Proceedings of the 2023 AAAI/ACM Conference on AI, Ethics, and Society, AIES ’23, page 723–741, New York, NY, USA. Association for Computing Machinery.
  19. The psychological well-being of content moderators: The emotional labor of commercial moderation and avenues for improving support. In Proceedings of the 2021 CHI Conference on Human Factors in Computing Systems, CHI ’21, New York, NY, USA. Association for Computing Machinery.
  20. Taxonomy of risks posed by language models. In Proceedings of the 2022 ACM Conference on Fairness, Accountability, and Transparency, FAccT ’22, page 214–229, New York, NY, USA. Association for Computing Machinery.
  21. BigScience Workshop. 2022. BLOOM: A 176b-parameter open-access multilingual language model.
  22. Opt: Open pre-trained transformer language models.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (3)
  1. Virginia K. Felkner (3 papers)
  2. Jennifer A. Thompson (1 paper)
  3. Jonathan May (76 papers)
Citations (6)