Towards detecting unanticipated bias in Large Language Models (2404.02650v1)
Abstract: Over the last year, LLMs like ChatGPT have become widely available and have exhibited fairness issues similar to those in previous machine learning systems. Current research is primarily focused on analyzing and quantifying these biases in training data and their impact on the decisions of these models, alongside developing mitigation strategies. This research largely targets well-known biases related to gender, race, ethnicity, and language. However, it is clear that LLMs are also affected by other, less obvious implicit biases. The complex and often opaque nature of these models makes detecting such biases challenging, yet this is crucial due to their potential negative impact in various applications. In this paper, we explore new avenues for detecting these unanticipated biases in LLMs, focusing specifically on Uncertainty Quantification and Explainable AI methods. These approaches aim to assess the certainty of model decisions and to make the internal decision-making processes of LLMs more transparent, thereby identifying and understanding biases that are not immediately apparent. Through this research, we aim to contribute to the development of fairer and more transparent AI systems.
- GPT4All: Training an Assistant-style Chatbot with Large Scale Data Distillation from GPT-3.5-Turbo. https://github.com/nomic-ai/gpt4all
- Omer Antverg and Yonatan Belinkov. 2022. On the Pitfalls of Analyzing Individual Neurons in Language Models. arXiv preprint arXiv:2110.07483 (August 2022). http://arxiv.org/abs/2110.07483
- Identifying and controlling important neurons in neural machine translation. arXiv preprint arXiv:1811.01157 (2018).
- Language (technology) is power: A critical survey of “bias” in NLP. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. Association for Computational Linguistics, Online, 5454–5476. https://doi.org/10.18653/v1/2020.acl-main.485
- Stereotyping Norwegian Salmon: An Inventory of Pitfalls in Fairness Benchmark Datasets. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), Chengqing Zong, Fei Xia, Wenjie Li, and Roberto Navigli (Eds.). Association for Computational Linguistics, Online, 1004–1015. https://doi.org/10.18653/v1/2021.acl-long.81
- Man is to Computer Programmer as Woman is to Homemaker? Debiasing Word Embeddings. arXiv:1607.06520 [cs.CL]
- Semantics derived automatically from language corpora contain human-like biases. Science 356, 6334 (2017), 183–186. https://doi.org/10.1126/science.aal4230
- The future landscape of large language models in medicine. Communications Medicine (London) 3, 1 (2023), 141. https://doi.org/10.1038/s43856-023-00370-1
- What is one grain of sand in the desert? Analyzing individual neurons in deep NLP models. Proceedings of the AAAI Conference on Artificial Intelligence 33, 01 (2019), 6309–6317.
- Explainable Artificial Intelligence for Bias Detection in COVID CT-Scan Classifiers. Sensors 21, 16 (2021), 5657.
- Yarin Gal and Zoubin Ghahramani. 2016. Dropout as a Bayesian approximation: Representing model uncertainty in deep learning. In ICML.
- Bias and Fairness in Large Language Models: A Survey. arXiv preprint arXiv:2309.00770 (2023).
- A Survey of Uncertainty in Deep Neural Networks. Artificial Intelligence Review (2023).
- Wei Guo and Aylin Caliskan. 2021. Detecting Emergent Intersectional Biases: Contextualized Word Embeddings Contain a Distribution of Human-like Biases. In Proceedings of the 2021 AAAI/ACM Conference on AI, Ethics, and Society (Virtual Event, USA) (AIES ’21). Association for Computing Machinery, New York, NY, USA, 122–133. https://doi.org/10.1145/3461702.3462536
- Predictive Uncertainty-based Bias Mitigation in Ranking. In Proceedings of the 32nd ACM International Conference on Information and Knowledge Management (CIKM ’23).
- Bias Mitigation for Machine Learning Classifiers: A Comprehensive Survey. arXiv:2207.07068 [cs.LG]
- Sarthak Jain and Byron C. Wallace. 2019. Attention is not explanation. arXiv preprint arXiv:1902.10186 (2019).
- Is BERT really robust? natural language attack on text classification and entailment. In AAAI Conference on Artificial Intelligence (AAAI).
- Interpretability beyond feature attribution: Quantitative testing with concept activation vectors (TCAV). In Proceedings of the 35th International Conference on Machine Learning. PMLR, 2668–2677.
- Learning loss for test-time augmentation. In Proceedings of the 34th International Conference on Neural Information Processing Systems (Vancouver, BC, Canada) (NIPS’20). Curran Associates Inc., Red Hook, NY, USA, Article 350, 12 pages.
- Accurate Uncertainties for Deep Learning Using Calibrated Regression. In Proceedings of the 35th International Conference on Machine Learning (Proceedings of Machine Learning Research, Vol. 80), Jennifer Dy and Andreas Krause (Eds.). PMLR, 2796–2804. https://proceedings.mlr.press/v80/kuleshov18a.html
- Jenny Kunz and Marco Kuhlmann. 2020. Classifier probes may just learn from linear context features. In Proceedings of the 28th International Conference on Computational Linguistics. International Committee on Computational Linguistics, 5136–5146.
- Simple and scalable predictive uncertainty estimation using deep ensembles. In Advances in Neural Information Processing Systems, Vol. 30.
- Visualizing and Understanding Neural Models in NLP. In Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. San Diego, California, 681–691.
- A Survey on Fairness in Large Language Models. arXiv:2308.10149 [cs.CL]
- Towards understanding in-context learning with contrastive demonstrations and saliency maps. arXiv preprint arXiv:2308.10149 (2023).
- Scott M. Lundberg and Su-In Lee. 2017. A unified approach to interpreting model predictions. In Advances in Neural Information Processing Systems, Vol. 30. Curran Associates, Inc.
- Aman Madaan and Amir Yazdanbakhsh. 2022. Text and patterns: For effective chain of thought, it takes two to tango. arXiv preprint arXiv:2209.07686 (2022).
- Rowan Hall Maudslay and Ryan Cotterell. 2021. Do syntactic probes probe syntax? experiments with jabberwocky probing. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. Association for Computational Linguistics, 144–150.
- On measuring social biases in sentence encoders. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers). Association for Computational Linguistics, 622–628. https://doi.org/10.18653/v1/N19-1063
- A Survey on Bias and Fairness in Machine Learning. Comput. Surveys 54, 6 (2022), 1–35.
- Greedy Policy Search: A Simple Baseline for Learnable Test-Time Augmentation. arXiv:2002.09103 [stat.ML]
- Explaining NonLinear Classification Decisions with Deep Taylor Decomposition. arXiv preprint arXiv:1512.02479 (2015).
- Layer-wise relevance propagation: an overview. Explainable AI: interpreting, explaining and visualizing deep learning (2019), 193–209.
- Bias Challenges in Counterfactual Data Augmentation. arXiv:2209.05104 [cs.LG]
- StereoSet: Measuring stereotypical bias in pretrained language models. In ACL.
- Gonzalo Nápoles and Lisa Koutsoviti Koumeri. 2022. A fuzzy-rough uncertainty measure to discover bias encoded explicitly or implicitly in features of structured pattern classification datasets. Pattern Recognition Letters 154 (2022).
- Mimic-IV-ICD: A new benchmark for eXtreme MultiLabel Classification. arXiv:2304.13998 [cs.AI]
- HONEST: Measuring hurtful sentence completion in language models. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. Association for Computational Linguistics, 2398–2406. https://doi.org/10.18653/v1/2021.naacl-main.191
- Large language models propagate race-based medicine. npj Digital Medicine 6, 195 (2023). https://doi.org/10.1038/s41746-023-00939-z
- OpenAI. 2023a. GPT-4 Technical Report. arXiv:2303.08774 [cs.CL]
- OpenAI. 2023b. Language models can explain neurons in language models. https://openai.com/research/language-models-can-explain-neurons-in-language-models
- Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of Machine Learning Research 21, 1 (2020), 5485–5551.
- ”Why should I trust you?”: Explaining the predictions of any classifier. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. ACM, 1135–1144.
- ”Why Should I Trust You?”: Explaining the Predictions of Any Classifier. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD ’16).
- Evidential Deep Learning to Quantify Classification Uncertainty. In Advances in Neural Information Processing Systems, S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. Cesa-Bianchi, and R. Garnett (Eds.), Vol. 31. Curran Associates, Inc. https://proceedings.neurips.cc/paper_files/paper/2018/file/a981f2b708044d6fb4a71a1463242520-Paper.pdf
- When and why test-time augmentation works. arXiv preprint arXiv:2011.11156 (2020).
- Text Data Augmentation for Deep Learning. Journal of Big Data 8 (2021), 101. https://doi.org/10.1186/s40537-021-00492-0
- Sahib Singh and Narayanan Ramakrishnan. 2023. Is ChatGPT Biased? A Review. https://osf.io/preprints/your-preprint-id Web.
- Implicit Visual Bias Mitigation by Posterior Estimate Sharpening of a Bayesian Neural Network. arXiv preprint arXiv:2303.16564 (2023).
- Harini Suresh and John Guttag. 2021. A framework for understanding sources of harm throughout the machine learning life cycle. Equity and Access in Algorithms, Mechanisms, and Optimization (2021), 1–9.
- How Prevalent is Gender Bias in ChatGPT? - Exploring German and English ChatGPT Responses. In 1st Workshop onBiased Data in Conversational Agents - colocated with ECML-PKDD ’23.
- Counterfactual explanations without opening the black box: Automated decisions and the GDPR. Harv. JL & Tech. 31 (2017), 841.
- ”Kelly is a Warm Person, Joseph is a Role Model”: Gender Biases in LLM-Generated Reference Letters. arXiv:2310.09219 [cs.CL]
- Towards understanding chain-of-thought prompting: An empirical study of what matters. arXiv preprint arXiv:2212.10001 (2022).
- Self-consistency improves chain of thought reasoning in language models. arXiv preprint arXiv:2203.11171 (2022).
- Measuring and reducing gendered correlations in pre-trained models. arXiv preprint arXiv:2010.06032 (2020). https://arxiv.org/abs/2010.06032
- Chain-of-thought prompting elicits reasoning in large language models. Advances in Neural Information Processing Systems 35 (2022), 24824–24837.
- Larger language models do in-context learning differently. arXiv preprint arXiv:2303.03846 (2023).
- Analyzing chain-of-thought prompting in large language models via gradient-based feature attributions. arXiv preprint arXiv:2306.01150 (2023).
- Polyjuice: Generating Counterfactuals for Explaining, Evaluating, and Improving Models. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers). Online, 6707–6723.
- Perturbed Masking: Parameter-free Probing for Analyzing and Interpreting BERT. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. Online, 4166–4176.
- Can LLMs Express Their Uncertainty? An Empirical Evaluation of Confidence Elicitation in LLMs. arXiv preprint arXiv:2306.13063 (2023).
- Explainability for Large Language Models: A Survey. arXiv preprint arXiv:2309.01029 (2023).
- Lima: Less is more for alignment. arXiv preprint arXiv:2305.11206 (2023).
- Counterfactual Data Augmentation for Mitigating Gender Stereotypes in Languages with Rich Morphology. ArXiv abs/1906.04571 (2019). https://api.semanticscholar.org/CorpusID:184486914
- Anna Kruspe (17 papers)