What Do Llamas Really Think? Revealing Preference Biases in Language Model Representations (2311.18812v1)
Abstract: Do LLMs exhibit sociodemographic biases, even when they decline to respond? To bypass their refusal to "speak," we study this research question by probing contextualized embeddings and exploring whether this bias is encoded in its latent representations. We propose a logistic Bradley-Terry probe which predicts word pair preferences of LLMs from the words' hidden vectors. We first validate our probe on three pair preference tasks and thirteen LLMs, where we outperform the word embedding association test (WEAT), a standard approach in testing for implicit association, by a relative 27% in error rate. We also find that word pair preferences are best represented in the middle layers. Next, we transfer probes trained on harmless tasks (e.g., pick the larger number) to controversial ones (compare ethnicities) to examine biases in nationality, politics, religion, and gender. We observe substantial bias for all target classes: for instance, the Mistral model implicitly prefers Europe to Africa, Christianity to Judaism, and left-wing to right-wing politics, despite declining to answer. This suggests that instruction fine-tuning does not necessarily debias contextualized embeddings. Our codebase is at https://github.com/castorini/biasprobe.
- Large language models associate muslims with violence. Nature Machine Intelligence.
- Yonatan Belinkov. 2022. Probing classifiers: Promises, shortcomings, and advances. Computational Linguistics.
- Ralph A. Bradley and Milton E. Terry. 1952. Rank analysis of incomplete block designs: I. the method of paired comparisons. Biometrika.
- Semantics derived automatically from language corpora contain human-like biases. Science.
- Vicuna: An open-source chatbot impressing GPT-4 with 90%* ChatGPT quality.
- Charles J. Clopper and Egon S. Pearson. 1934. The use of confidence or fiducial limits illustrated in the case of the binomial. Biometrika.
- BERT: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers).
- A survey for in-context learning. arXiv:2301.00234.
- Probing explicit and implicit gender bias through LLM conditional text generation. arXiv:2311.00306.
- Assessing the reliability of word embedding gender bias measures. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing.
- From pretraining data to language models to downstream tasks: Tracking the trails of political biases leading to unfair NLP models. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers).
- The Pile: An 800GB dataset of diverse text for language modeling. arXiv:2101.00027.
- Hila Gonen and Yoav Goldberg. 2019. Lipstick on a pig: Debiasing methods cover up systematic gender biases in word embeddings but do not remove them. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers).
- Survey on sociodemographic bias in natural language processing. arXiv:2306.08158.
- John Hewitt and Percy Liang. 2019. Designing and interpreting probes with control tasks. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP).
- John Hewitt and Christopher D. Manning. 2019. A structural probe for finding syntax in word representations. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers).
- Kek, cucks, and god emperor trump: A measurement study of 4chan’s politically incorrect forum and its effects on the web. In Proceedings of the International AAAI Conference on Web and Social Media.
- Mistral 7B. arXiv:2310.06825.
- Inserting information bottlenecks for attribution in transformers. In Findings of the Association for Computational Linguistics: EMNLP 2020.
- June Lee. 2023. WizardVicunaLM: GitHub repository. https://github.com/melodysdreamj/WizardVicunaLM.
- RoBERTa: A robustly optimized BERT pretraining approach. arXiv:1907.11692.
- On measuring social biases in sentence encoders. arXiv:1903.10561.
- MosaicML. 2023. Introducing MPT-30B: Raising the bar for open-source foundation models.
- Training language models to follow instructions with human feedback. Advances in Neural Information Processing Systems.
- Raiders of the lost kek: 3.5 years of augmented 4chan posts from the politically incorrect board. In Proceedings of the international AAAI conference on web and social media.
- GloVe: Global vectors for word representation. In Conference on Empirical Methods in Natural Language Processing.
- Language models are unsupervised multitask learners. OpenAI Blog.
- Exploring the limits of transfer learning with a unified text-to-text transformer. The Journal of Machine Learning Research.
- Code Llama: Open foundation models for code. arXiv:2308.12950.
- Sofia Serrano and Noah A. Smith. 2019. Is attention interpretable? In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics.
- Societal biases in language generation: Progress and challenges. In Annual Meeting of the Association for Computational Linguistics.
- Unsupervised contrast-consistent ranking with language models. arXiv:2309.06991.
- Is ChatGPT good at search? Investigating large language models as re-ranking agent. arXiv:2304.09542.
- Yi Chern Tan and Elisa Celis. 2019. Assessing social and intersectional biases in contextualized word representations. arXiv:1911.01485.
- Computer Together. 2023. RedPajama: An open source recipe to reproduce LLaMA training dataset.
- Llama 2: Open foundation and fine-tuned chat models. arXiv:2307.09288.
- Daphne Joanna Van der Pas and Loes Aaldering. 2020. Gender differences in political media coverage: A meta-analysis. Journal of Communication.
- Attention is all you need. Advances in Neural Information Processing Systems.
- Nationality bias in text generation. In Proceedings of the 17th Conference of the European Chapter of the Association for Computational Linguistics.
- AllenNLP interpret: A framework for explaining predictions of NLP models. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP): System Demonstrations.
- Ben Wang and Aran Komatsuzaki. 2021. GPT-J-6B: A 6 billion parameter autoregressive language model.
- Steven W. Webster and Bethany Albertson. 2022. Emotion and politics: Noncognitive psychological biases in public opinion. Annual review of political science.
- A survey of large language models. arXiv:2303.18223.
- Problems with cosine as a measure of embedding similarity for high frequency words. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers).