Papers
Topics
Authors
Recent
Search
2000 character limit reached

Taxonomy-based CheckList for Large Language Model Evaluation

Published 15 Dec 2023 in cs.CL | (2402.10899v1)

Abstract: As LLMs have been used in many downstream tasks, the internal stereotypical representation may affect the fairness of the outputs. In this work, we introduce human knowledge into natural language interventions and study pre-trained LLMs' (LMs) behaviors within the context of gender bias. Inspired by CheckList behavioral testing, we present a checklist-style task that aims to probe and quantify LMs' unethical behaviors through question-answering (QA). We design three comparison studies to evaluate LMs from four aspects: consistency, biased tendency, model preference, and gender preference switch. We probe one transformer-based QA model trained on SQuAD-v2 dataset and one autoregressive LLM. Our results indicate that transformer-based QA model's biased tendency positively correlates with its consistency, whereas LLM shows the opposite relation. Our proposed task provides the first dataset that involves human knowledge for LLM bias evaluation.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (39)
  1. A distributional study of negated adjectives and antonyms. CEUR Workshop Proceedings.
  2. Logic-guided data augmentation and regularization for consistent question answering. arXiv preprint arXiv:2004.10157.
  3. Man is to computer programmer as woman is to homemaker? debiasing word embeddings. Advances in neural information processing systems, 29.
  4. Identifying and reducing gender bias in word-level language models. arXiv preprint arXiv:1904.03035.
  5. Center, O. R. 2019. National Center for O*NET Development. Data retrieved from World Development Indicators, https://www.onetcenter.org/taxonomy.html#oca.
  6. Measuring gender bias in word embeddings across domains and discovering new gender bias word categories. In Proceedings of the First Workshop on Gender Bias in Natural Language Processing, 25–32.
  7. Transformers as soft reasoners over language. arXiv preprint arXiv:2002.05867.
  8. On measuring and mitigating biased inferences of word embeddings. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 34, 7659–7666.
  9. OSCaR: Orthogonal subspace correction and rectification of biases in word embeddings. arXiv preprint arXiv:2007.00049.
  10. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805.
  11. The turking test: Can language models understand instructions? arXiv preprint arXiv:2010.11982.
  12. Moral stories: Situated reasoning about norms, intents, actions, and their consequences. arXiv preprint arXiv:2012.15738.
  13. Evaluating models’ local decision boundaries via contrast sets. arXiv preprint arXiv:2004.02709.
  14. Word embeddings quantify 100 years of gender and ethnic stereotypes. Proceedings of the National Academy of Sciences, 115(16): E3635–E3644.
  15. Cognitive mechanisms for transitive inference performance in rhesus monkeys: measuring the influence of associative strength and inferred order. Journal of Experimental Psychology: Animal Behavior Processes, 38(4): 331.
  16. BECEL: Benchmark for Consistency Evaluation of Language Models. In Proceedings of the 29th International Conference on Computational Linguistics, 3680–3696.
  17. UNQOVERing stereotyping biases via underspecified questions. arXiv preprint arXiv:2010.02428.
  18. RoBERTa: A Robustly Optimized BERT Pretraining Approach. CoRR, abs/1907.11692.
  19. Natural instructions: Benchmarking generalization to new tasks from natural language instructions. arXiv preprint arXiv:2104.08773, 839–849.
  20. N., S. M. 2013. BEHAVIORAL CONSISTENCY.
  21. OpenAI. 2023. GPT3.5-turbo.
  22. Glove: Global vectors for word representation. In Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP), 1532–1543.
  23. Deep Contextualized Word Representations. In Walker, M.; Ji, H.; and Stent, A., eds., Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), 2227–2237. New Orleans, Louisiana: Association for Computational Linguistics.
  24. Squad: 100,000+ questions for machine comprehension of text. arXiv preprint arXiv:1606.05250.
  25. Null it out: Guarding protected attributes by iterative nullspace projection. arXiv preprint arXiv:2004.07667.
  26. Linguistic models for analyzing and detecting biased language. In Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 1650–1659.
  27. Beyond accuracy: Behavioral testing of NLP models with CheckList. arXiv preprint arXiv:2005.04118.
  28. Gender bias in coreference resolution. arXiv preprint arXiv:1804.09301.
  29. Thinking like a skeptic: Defeasible inference in natural language. In Findings of the Association for Computational Linguistics: EMNLP 2020, 4661–4675.
  30. Few-shot text generation with pattern-exploiting training. arXiv preprint arXiv:2012.11926.
  31. Towards controllable biases in language generation. arXiv preprint arXiv:2005.00268.
  32. Evaluating gender bias in machine translation. arXiv preprint arXiv:1906.00591.
  33. Assessing social and intersectional biases in contextualized word representations. Advances in neural information processing systems, 32.
  34. Universal adversarial triggers for attacking and analyzing NLP. arXiv preprint arXiv:1908.07125.
  35. Chain-of-thought prompting elicits reasoning in large language models. Advances in Neural Information Processing Systems, 35: 24824–24837.
  36. Learning from task descriptions. arXiv preprint arXiv:2011.08115.
  37. Zero-shot learning by generating task-specific adapters. arXiv preprint arXiv:2101.00420.
  38. Ethical-advice taker: Do language models understand natural language interventions? arXiv preprint arXiv:2106.01465.
  39. Gender bias in coreference resolution: Evaluation and debiasing methods. arXiv preprint arXiv:1804.06876.

Summary

  • The paper introduces a checklist-based evaluation leveraging the O*NET-SOC taxonomy to measure gender bias in LLMs.
  • It applies Chain-of-Thought prompting to decompose complex queries and enhance probing for model consistency and bias.
  • Experimental results reveal RoBERTa-large's stable bias patterns versus GPT-3.5-turbo’s partial bias mitigation with added context.

Taxonomy-Based CheckList for LLM Evaluation

The paper "Taxonomy-based CheckList for LLM Evaluation" (2402.10899) introduces a methodology for evaluating gender bias in LLMs using a checklist-style task that incorporates human knowledge from the O*NET-SOC taxonomy. This approach aims to probe and quantify unethical behaviors in LLMs through question-answering, focusing on consistency, biased tendency, model preference, and gender preference switch. The study compares a transformer-based QA model (RoBERTa-large) and an autoregressive LLM (GPT-3.5-turbo-instruct), revealing contrasting relationships between consistency and bias.

Methodology

Chain-of-Thought CheckList

The paper builds upon the CheckList behavioral testing framework, which uses a matrix to list capabilities and test types, enabling comparison across different models [ribeiro2020beyond]. To address the limitations of "shallow" questions in probing biased behavior, the authors incorporate Chain-of-Thought (CoT) prompting, which breaks down complex tasks into logical steps [wei2022chain]. Instead of relying solely on CoT, the paper introduces human knowledge as part of the query to probe LLMs for deeper understanding and behaviors.

Taxonomy-Based Context

The authors argue that simply appending interventions to questions may not be sufficient for LLMs to provide accurate answers [zhao2021ethical]. Inspired by CoT, they break down the behavior on a question into behaviors on its attributes, using the O*NET-SOC taxonomy to provide additional context. This taxonomy includes detailed occupational titles, duty descriptions, and required attributes, allowing for a more nuanced understanding of model behaviors.

Dataset Construction

The dataset consists of 70 female-male pairs and 62 occupational titles from the O*NET-SOC Taxonomy [ONET2019]. Three common attribute categories—skill, knowledge, and ability—are used to build the dataset. The dataset is structured into two parts: the first part establishes the relationship between attributes and subjects, while the second part filters the output and appends relevant attributes to the context based on the question. Three types of questions are used for comparison purposes: binary, single subject, and multiple subjects.

Experiments and Results

Experimental Setup

The experiments evaluate RoBERTa-large, fine-tuned on SQuAD-v2 [rajpurkar2016squad], and GPT-3.5-turbo-instruct [gpt352023] in zero-shot settings. The input prompt includes the base context, attributes, and question. The evaluation focuses on logical consistency, model preference, model bias, and gender preference switch. Logical consistency is assessed using additive consistency, where the model's preference is evaluated by combining base queries with attribute contexts. Figure 1

Figure 1: Aggregated average scores across gendered names for different aspects: consistency, bias, model preference of female, model preference of male, female switch to male, male switch to female.

Key Findings

The results indicate that RoBERTa-large exhibits a positive correlation between consistency and bias, suggesting that its predictions remain unchanged even with additional attribute context. In contrast, GPT-3.5-turbo shows an overall mitigation of biased tendencies, changing its behavior after additional context is added. However, GPT-3.5-turbo demonstrates preferences for certain gendered groups when it does not consider either subject qualified for an occupation. Further analysis reveals that GPT-3.5-turbo tends to favor male subjects for stereotypical masculine jobs and female subjects for some professions, particularly in politics and arts. These findings suggest that the alignment techniques in GPT-3.5-turbo partially mitigate bias, particularly for female-related attributes.

The paper references prior research on bias in NLP, including studies on vector representations, downstream tasks, and benchmark datasets [pennington2014glove, peters-etal-2018-deep, devlin2018bert, rudinger2018gender, stanovsky2019evaluating, li2020unqovering, zhao2021ethical]. It also acknowledges works on model adaptation based on input changes and the ability of models to adjust confidence levels upon observing new information [wallace2019universal, gardner2020evaluating, emelin2020moral, ye2021zero, schick2020few, sheng2020towards, rudinger2020thinking, clark2020transformers, weller2020learning, efrat2020turking, mishra2021natural]. The authors position their work as a novel approach to comparative bias evaluation by incorporating human knowledge taxonomy.

Conclusion

The study presents a framework for bias evaluation in QA settings, utilizing human knowledge taxonomy to compare transformer-based models and LLMs. The framework explores implicit bias and evaluates logical consistency, revealing insights into the behavior of RoBERTa and GPT-3.5-turbo. The authors acknowledge the limitations of their work, including the binary view of gender and the Western-specific structure of the occupation taxonomy. Future work will focus on comparing open-sourced LLMs and exploring the gendered group against which models are biased.

Paper to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Authors (1)

Collections

Sign up for free to add this paper to one or more collections.

Tweets

Sign up for free to view the 2 tweets with 0 likes about this paper.