Stereotype Detection in LLMs: A Multiclass, Explainable, and Benchmark-Driven Approach (2404.01768v2)
Abstract: Stereotype detection is a challenging and subjective task, as certain statements, such as "Black people like to play basketball," may not appear overtly toxic but still reinforce racial stereotypes. With the increasing prevalence of LLMs in human-facing AI applications, detecting these types of biases is essential. However, LLMs risk perpetuating and amplifying stereotypical outputs derived from their training data. A reliable stereotype detector is crucial for benchmarking bias, monitoring model input and output, filtering training data, and ensuring fairer model behavior in downstream applications. This paper introduces the Multi-Grain Stereotype (MGS) dataset, consisting of 51,867 instances across gender, race, profession, religion, and other stereotypes, curated from multiple existing datasets. We evaluate various machine learning approaches to establish baselines and fine-tune LLMs of different architectures and sizes, presenting a suite of stereotype multiclass classifiers trained on the MGS dataset. Given the subjectivity of stereotypes, explainability is essential to align model learning with human understanding of stereotypes. We employ explainable AI (XAI) tools, including SHAP, LIME, and BertViz, to assess whether the model's learned patterns align with human intuitions about stereotypes.Additionally, we develop stereotype elicitation prompts and benchmark the presence of stereotypes in text generation tasks using popular LLMs, employing the best-performing stereotype classifiers.
- Natural language interaction with explainable ai models. ArXiv, abs/1903.05720, 2019.
- Large dimensional analysis and improvement of multi task learning. ArXiv, abs/2009.01591, 2020. URL https://consensus.app/papers/dimensional-analysis-improvement-multi-task-learning-ali/2eab25d367e3531786f5a461b204b217/?utm_source=chatgpt.
- Falcon-40B: an open large language model with state-of-the-art performance. 2023.
- Machine bias, 2016. URL https://www.propublica.org/article/machine-bias-risk-assessments-in-criminal-sentencing.
- Vijay Arya et al. Ai explainability 360: An extensible toolkit for understanding data and machine learning models. J. Mach. Learn. Res., 21:”130:1–130:6”, 2020.
- Exposure to ideologically diverse news and opinion on facebook. Science, 348(6239):1130–1132, 2015.
- Man is to computer programmer as woman is to homemaker? debiasing word embeddings. In Advances in Neural Information Processing Systems, 2016.
- Identifying and reducing gender bias in word-level language models. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Student Research Workshop, pp. 7–15, Minneapolis, Minnesota, 2019. Association for Computational Linguistics.
- Language models are few-shot learners, 2020.
- Semantics derived automatically from language corpora contain human-like biases. Science, 356(6334):183–186, 2017.
- Recall and learn: Fine-tuning deep pretrained language models with less forgetting. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), 2020.
- Kimberle Crenshaw. Demarginalizing the intersection of race and sex: A black feminist critique of antidiscrimination doctrine, feminist theory and antiracist politics. U. Chi. Legal f., 1989:139, 1989.
- Racial bias in hate speech and abusive language detection datasets. In Proceedings of the Third Workshop on Abusive Language Online, pp. 25–35, Florence, Italy, 2019. Association for Computational Linguistics.
- J. Dessureault and D. Massicotte. Ai2: a novel explainable machine learning framework using an nlp interface. In Proceedings of the 2023 8th International Conference on Machine Learning Technologies, 2023.
- Bold: Dataset and metrics for measuring biases in open-ended language generation. In Proceedings of the 2021 ACM Conference on Fairness, Accountability, and Transparency, FAccT ’21, pp. 862–872, New York, NY, USA, 2021. Association for Computing Machinery. ISBN 9781450383097. doi: 10.1145/3442188.3445924. URL https://doi.org/10.1145/3442188.3445924.
- Artificial intelligence and business models in the sustainable development goals perspective: A systematic literature review. Journal of Business Research, 121:283–314, 2020.
- Multi-dimensional gender bias classification. arXiv preprint arXiv:2005.00614, 2020.
- Winoqueer: A community-in-the-loop benchmark for anti-lgbtq+ bias in large language models, 2023.
- Common Crawl Foundation. The common crawl corpus, 2021a. URL https://commoncrawl.org/.
- Wikipedia Foundation. Wikipedia: The free encyclopedia, 2021b. URL https://www.wikipedia.org/.
- Computational modeling of stereotype content in text. Frontiers in Artificial Intelligence, 5, 2022. doi: 10.3389/frai.2022.826207. URL https://consensus.app/papers/modeling-stereotype-content-text-fraser/42d3598f963a530692b7b4669ce0977a/?utm_source=chatgpt.
- Attention is not explanation, 2019.
- Seegull: A stereotype benchmark with broad geo-cultural coverage leveraging generative models, 2023.
- Learning a model with the most generality for small-sample problems. In Proceedings of the 2022 5th International Conference on Algorithms, Computing and Artificial Intelligence. ACM, 2022. URL https://consensus.app/papers/learning-model-generality-smallsample-problems-jin/b0d71dda45e25cdfb5a49cea5a550976/?utm_source=chatgpt.
- JongyoonSong. Jongyoonsong/k-stereoset. URL https://github.com/JongyoonSong/K-StereoSet.
- Holistic evaluation of language models, 2022.
- A token-level reference-free hallucination detection benchmark for free-form text generation. arXiv preprint arXiv:2104.08704, 2021a.
- Roberta: A robustly optimized bert pretraining approach. ArXiv, abs/1907.11692, 2021b. URL https://arxiv.org/abs/1907.11692.
- A unified approach to interpreting model predictions. In Advances in Neural Information Processing Systems, pp. 4768–4777, 2017.
- Intersectional bias in causal language models. arXiv preprint arXiv:2107.07691, 2021.
- On measuring social biases in sentence encoders. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pp. 622–628, Minneapolis, Minnesota, June 2019a. Association for Computational Linguistics. doi: 10.18653/v1/N19-1063. URL https://aclanthology.org/N19-1063.
- On measuring social biases in sentence encoders. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pp. 622–628, Minneapolis, Minnesota, 2019b. Association for Computational Linguistics.
- A survey on bias and fairness in machine learning. ACM Computing Surveys (CSUR), 54(6):1–35, 2021.
- Stereoset: Measuring stereotypical bias in pretrained language models, 2020.
- Crows-pairs: A challenge dataset for measuring social biases in masked language models, 2020.
- Talking to bots: Symbiotic agency and the case of tay. International Journal of Communication, 10:4915–4931, 10 2016.
- A spontaneous stereotype content model: Taxonomy, properties, and prediction. Journal of Personality and Social Psychology, 2022. doi: 10.1037/pspa0000312. URL https://consensus.app/papers/stereotype-content-model-taxonomy-properties-prediction-nicolas/2dd5e58ce57d5c418c965595b106bca7/?utm_source=chatgpt.
- Artificial intelligence for sustainability: Challenges, opportunities, and a research agenda. Int. J. Inf. Manag., 53:102104, 2020.
- OpenAI. Gpt-4 technical report, 2023.
- BBQ: A hand-built bias benchmark for question answering. In Findings of the Association for Computational Linguistics: ACL 2022, pp. 2086–2105, Dublin, Ireland, May 2022. Association for Computational Linguistics. doi: 10.18653/v1/2022.findings-acl.165. URL https://aclanthology.org/2022.findings-acl.165.
- Assessing gender bias in machine translation – a case study with google translate, 2019.
- Reinforcement guided multi-task learning framework for low-resource stereotype detection. ArXiv, abs/2203.14349, 2022. URL https://arxiv.org/abs/2203.14349.
- Language models are unsupervised multitask learners. 2019.
- Dbias: Detecting biases and ensuring fairness in news articles, 2022.
- A. Sheth et al. Knowledge-intensive language understanding for explainable ai. IEEE Internet Computing, 25:19–24, 2021.
- Meilin Shi et al. Thinking geographically about ai sustainability. AGILE: GIScience Series, 2023.
- Damien Sileo. tasksource: Structured dataset preprocessing annotations for frictionless extreme multi-task learning and evaluation. arXiv preprint arXiv:2301.05948, 2023. URL https://arxiv.org/abs/2301.05948.
- Thilo Spinner et al. explainer: A visual analytics framework for interactive and explainable machine learning. IEEE Transactions on Visualization and Computer Graphics, 26:1064–1074, 2019.
- Evaluating the explainers: Black-box explainable machine learning for student success prediction in moocs, 2022.
- How do you speak about immigrants? taxonomy and stereoimmigrants dataset for identifying stereotypes about immigrants. Applied Sciences, 11:3610, 2021. doi: 10.3390/app11083610. URL https://consensus.app/papers/immigrants-stereoimmigrants-dataset-identifying-sánchezjunquera/7ade3b42b4f7571ab2ea67eee30fb8fb/?utm_source=chatgpt.
- Confident AI Team. Deepeval: A benchmarking framework for language learning models. https://github.com/confident-ai/deepeval, 2023.
- Llama: Open and efficient foundation language models, 2023a.
- Llama 2: Open foundation and fine-tuned chat models, 2023b.
- “why should i trust you?” explaining the predictions of any classifier. In Proceedings of the 22nd ACM SIGKDD international conference on knowledge discovery and data mining, pp. 1135–1144, 2016.
- Jesse Vig. A multiscale visualization of attention in the transformer model. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics: System Demonstrations, pp. 37–42, Florence, Italy, July 2019. Association for Computational Linguistics. doi: 10.18653/v1/P19-3007. URL https://www.aclweb.org/anthology/P19-3007.
- R. Vinuesa et al. The role of artificial intelligence in achieving the sustainable development goals. Nature Communications, 11, 2020.
- Interpretable deep-learning models to help achieve the sustainable development goals. Nature Machine Intelligence, 3(11):926–926, 2021.
- Emergent abilities of large language models, 2022.
- whylabs. Langkit: An open-source text metrics toolkit for monitoring language models. https://github.com/whylabs/langkit, 2023.
- B. Yoon. A machine learning approach for efficient multi-dimensional integration. Scientific Reports, 11, 2020. URL https://consensus.app/papers/machine-learning-approach-integration-yoon/e6996591b0b558c6815820b36f17ad24/?utm_source=chatgpt.
- Aligning books and movies: Towards story-like visual explanations by watching movies and reading books. In Proceedings of the IEEE International Conference on Computer Vision. IEEE Computer Society, 2015.
- Zekun Wu (20 papers)
- Sahan Bulathwela (19 papers)
- Maria Perez-Ortiz (92 papers)
- Adriano Soares Koshiyama (4 papers)