GPT is Not an Annotator: The Necessity of Human Annotation in Fairness Benchmark Construction (2405.15760v1)

Published 24 May 2024 in cs.CL and cs.CY

Abstract: Social biases in LLMs are usually measured via bias benchmark datasets. Current benchmarks have limitations in scope, grounding, quality, and human effort required. Previous work has shown success with a community-sourced, rather than crowd-sourced, approach to benchmark development. However, this work still required considerable effort from annotators with relevant lived experience. This paper explores whether an LLM (specifically, GPT-3.5-Turbo) can assist with the task of developing a bias benchmark dataset from responses to an open-ended community survey. We also extend the previous work to a new community and set of biases: the Jewish community and antisemitism. Our analysis shows that GPT-3.5-Turbo has poor performance on this annotation task and produces unacceptable quality issues in its output. Thus, we conclude that GPT-3.5-Turbo is not an appropriate substitute for human annotation in sensitive tasks related to social biases, and that its use actually negates many of the benefits of community-sourcing bias benchmarks.

PDF HTML Abstract

GPT is Not an Annotator: The Necessity of Human Annotation in Fairness Benchmark Construction

The paper "GPT is Not an Annotator: The Necessity of Human Annotation in Fairness Benchmark Construction" by Virginia K. Felkner, Jennifer A. Thompson, and Jonathan May, critically assesses the effectiveness of LLMs like GPT-3.5-Turbo in assisting with the creation of bias benchmarks. This work specifically focuses on antisemitism within the Jewish community, resulting in the introduction of the WinoSemitism dataset. It draws its methodologies from earlier efforts such as WinoQueer, which involved community-sourced bias benchmarks for LGBTQ+ populations. The paper stresses that despite advancements in LLMs, human expertise remains indispensable for developing high-quality benchmarks for bias measurement.

Key Contributions and Methods

The contributions of this paper are multifaceted:

Introduction of the WinoSemitism Benchmark: The authors curated a community-sourced dataset aimed at evaluating antisemitic biases in LLMs. This dataset was developed through meticulous community involvement to establish stereotypes and biases pertinent to the Jewish community.
Development of GPT-Variant Datasets: Named GPT-WinoQueer and GPT-WinoSemitism, these datasets employ predicates extracted by GPT-3.5-Turbo instead of human annotation.
Quantitative and Qualitative Analysis: A comparison between the human-annotated and GPT-annotated datasets demonstrated significant quality issues when using GPT-3.5-Turbo for predicate extraction.

Processes and Evaluation

Jewish Community Survey

The authors conducted a survey targeting Jewish respondents to elicit stereotypes and biases. The survey, conducted over ten weeks, received 203 responses predominantly from white, Ashkenazi, English-speaking, and U.S.-based participants. The survey involved demographic questions, multiple-choice questions about stereotypes, and open-ended questions for additional stereotypes.

Construction of WinoSemitism Benchmark

The WinoSemitism benchmark was constructed using:

Sentence Templates: Structures to frame the sentences.
Identity Descriptors and Counterfactual Descriptors: Labels like "Jewish people" and non-Jewish identifiers such as "Christian".
Common Jewish Names and Attested Harm Predicates: Identifiers selected for their syntactic correctness based on survey responses.

These elements formed pairs of stereotypical and counter-stereotypical sentences, yielding 58,816 sentence pairs, such as "All Jews are greedy" and "All Christians are greedy".

GPT Extraction and Benchmark Construction

To validate the possibility of reducing human annotation effort, GPT-3.5-Turbo was employed to extract harm predicates from survey data. These predicates were used in sentence templates identical to those used for human-annotated datasets, forming the GPT-WinoQueer and GPT-WinoSemitism datasets. The models yielded predicates that differed significantly in quality and accuracy.

Results

WinoSemitism Baseline

Evaluated across 20 different LLMs from seven families, including BERT, RoBERTa, ALBERT, BART, GPT-2, BLOOM, and OPT, the WinoSemitism baseline results revealed notable antisemitic bias. For example, average bias scores were:

BERT: 69.53
RoBERTa: 66.51
GPT-2: 70.11
BLOOM: 70.31

This suggests these models are considerably biased, more than twice likely to apply antisemitic stereotypes to Jews than non-Jews.

Predicate Extraction Performance

The GPT-extracted predicates suffered from multiple quality issues:

Low Exact Match Rates: The rates of exact matches between human and GPT-extracted predicates were significantly low, e.g., 18.14% for WinoSemitism.
Semantic and Syntactic Errors: High rates of predicates needing syntactic corrections were flagged in the analysis.
Hallucinations: Notable instances where GPT generated stereotypes not present in the survey data.

Implications and Future Directions

Practical Implications

The findings indicate that GPT-3.5-Turbo is unable to replace human annotators effectively, failing to capture nuanced and context-specific biases accurately. The model often misrepresents or invents biases, particularly harmful in high-stakes scenarios of bias benchmark construction. Consequently, the use of LLMs for bias annotation could inadvertently introduce misleading and inaccurate benchmarks, negating the purpose of community-sourced benchmarks.

Theoretical Implications and Speculations for AI Development

From a theoretical perspective, the paper highlights the critical role of human annotators in tasks requiring deep contextual understanding and sensitivity. It suggests that while LLMs have advanced in many areas, they are still lacking in capabilities concerning nuanced and contextually rich tasks. Future AI developments could focus on improving the contextual understanding of these models to better handle such sensitive tasks. However, the emphasis on ethical considerations, particularly in AI applications related to social biases, should not be understated.

Conclusion

This research paper underlines the essential role of human annotation in creating high-quality, community-grounded fairness benchmarks for LLMs. Despite the potential of LLMs like GPT-3.5-Turbo, their use in extracting stereotypes for benchmark construction introduces significant inaccuracies and biases. Therefore, expert human involvement remains crucial, especially in tasks heavily reliant on contextual and sensitive understanding. This work sheds light on the limitations of current AI tools in domains requiring deep human insight and reaffirms the necessity of expert human intervention in constructing fairness benchmarks.