Taxonomy-based CheckList for Large Language Model Evaluation

Published 15 Dec 2023 in cs.CL | (2402.10899v1)

Abstract: As LLMs have been used in many downstream tasks, the internal stereotypical representation may affect the fairness of the outputs. In this work, we introduce human knowledge into natural language interventions and study pre-trained LLMs' (LMs) behaviors within the context of gender bias. Inspired by CheckList behavioral testing, we present a checklist-style task that aims to probe and quantify LMs' unethical behaviors through question-answering (QA). We design three comparison studies to evaluate LMs from four aspects: consistency, biased tendency, model preference, and gender preference switch. We probe one transformer-based QA model trained on SQuAD-v2 dataset and one autoregressive LLM. Our results indicate that transformer-based QA model's biased tendency positively correlates with its consistency, whereas LLM shows the opposite relation. Our proposed task provides the first dataset that involves human knowledge for LLM bias evaluation.

Abstract PDF HTML Upgrade to Chat

Authors (1)

Damin Zhang

References (39)

Summary

The paper introduces a checklist-based evaluation leveraging the O*NET-SOC taxonomy to measure gender bias in LLMs.
It applies Chain-of-Thought prompting to decompose complex queries and enhance probing for model consistency and bias.
Experimental results reveal RoBERTa-large's stable bias patterns versus GPT-3.5-turbo’s partial bias mitigation with added context.

Taxonomy-Based CheckList for LLM Evaluation

The paper "Taxonomy-based CheckList for LLM Evaluation" (2402.10899) introduces a methodology for evaluating gender bias in LLMs using a checklist-style task that incorporates human knowledge from the O*NET-SOC taxonomy. This approach aims to probe and quantify unethical behaviors in LLMs through question-answering, focusing on consistency, biased tendency, model preference, and gender preference switch. The study compares a transformer-based QA model (RoBERTa-large) and an autoregressive LLM (GPT-3.5-turbo-instruct), revealing contrasting relationships between consistency and bias.

Methodology

Chain-of-Thought CheckList

The paper builds upon the CheckList behavioral testing framework, which uses a matrix to list capabilities and test types, enabling comparison across different models [ribeiro2020beyond]. To address the limitations of "shallow" questions in probing biased behavior, the authors incorporate Chain-of-Thought (CoT) prompting, which breaks down complex tasks into logical steps [wei2022chain]. Instead of relying solely on CoT, the paper introduces human knowledge as part of the query to probe LLMs for deeper understanding and behaviors.

Taxonomy-Based Context

The authors argue that simply appending interventions to questions may not be sufficient for LLMs to provide accurate answers [zhao2021ethical]. Inspired by CoT, they break down the behavior on a question into behaviors on its attributes, using the O*NET-SOC taxonomy to provide additional context. This taxonomy includes detailed occupational titles, duty descriptions, and required attributes, allowing for a more nuanced understanding of model behaviors.

Dataset Construction

The dataset consists of 70 female-male pairs and 62 occupational titles from the O*NET-SOC Taxonomy [ONET2019]. Three common attribute categories—skill, knowledge, and ability—are used to build the dataset. The dataset is structured into two parts: the first part establishes the relationship between attributes and subjects, while the second part filters the output and appends relevant attributes to the context based on the question. Three types of questions are used for comparison purposes: binary, single subject, and multiple subjects.

Experiments and Results

Experimental Setup

The experiments evaluate RoBERTa-large, fine-tuned on SQuAD-v2 [rajpurkar2016squad], and GPT-3.5-turbo-instruct [gpt352023] in zero-shot settings. The input prompt includes the base context, attributes, and question. The evaluation focuses on logical consistency, model preference, model bias, and gender preference switch. Logical consistency is assessed using additive consistency, where the model's preference is evaluated by combining base queries with attribute contexts.

Figure 1: Aggregated average scores across gendered names for different aspects: consistency, bias, model preference of female, model preference of male, female switch to male, male switch to female.

Key Findings

The results indicate that RoBERTa-large exhibits a positive correlation between consistency and bias, suggesting that its predictions remain unchanged even with additional attribute context. In contrast, GPT-3.5-turbo shows an overall mitigation of biased tendencies, changing its behavior after additional context is added. However, GPT-3.5-turbo demonstrates preferences for certain gendered groups when it does not consider either subject qualified for an occupation. Further analysis reveals that GPT-3.5-turbo tends to favor male subjects for stereotypical masculine jobs and female subjects for some professions, particularly in politics and arts. These findings suggest that the alignment techniques in GPT-3.5-turbo partially mitigate bias, particularly for female-related attributes.

The paper references prior research on bias in NLP, including studies on vector representations, downstream tasks, and benchmark datasets [pennington2014glove, peters-etal-2018-deep, devlin2018bert, rudinger2018gender, stanovsky2019evaluating, li2020unqovering, zhao2021ethical]. It also acknowledges works on model adaptation based on input changes and the ability of models to adjust confidence levels upon observing new information [wallace2019universal, gardner2020evaluating, emelin2020moral, ye2021zero, schick2020few, sheng2020towards, rudinger2020thinking, clark2020transformers, weller2020learning, efrat2020turking, mishra2021natural]. The authors position their work as a novel approach to comparative bias evaluation by incorporating human knowledge taxonomy.

Conclusion

The study presents a framework for bias evaluation in QA settings, utilizing human knowledge taxonomy to compare transformer-based models and LLMs. The framework explores implicit bias and evaluates logical consistency, revealing insights into the behavior of RoBERTa and GPT-3.5-turbo. The authors acknowledge the limitations of their work, including the binary view of gender and the Western-specific structure of the occupation taxonomy. Future work will focus on comparing open-sourced LLMs and exploring the gendered group against which models are biased.

Markdown Report Issue