Questioning the Survey Responses of Large Language Models (2306.07951v4)
Abstract: Surveys have recently gained popularity as a tool to study LLMs. By comparing survey responses of models to those of human reference populations, researchers aim to infer the demographics, political opinions, or values best represented by current LLMs. In this work, we critically examine this methodology on the basis of the well-established American Community Survey by the U.S. Census Bureau. Evaluating 43 different LLMs using de-facto standard prompting methodologies, we establish two dominant patterns. First, models' responses are governed by ordering and labeling biases, for example, towards survey responses labeled with the letter "A". Second, when adjusting for these systematic biases through randomized answer ordering, models across the board trend towards uniformly random survey responses, irrespective of model size or pre-training data. As a result, in contrast to conjectures from prior work, survey-derived alignment measures often permit a simple explanation: models consistently appear to better represent subgroups whose aggregate statistics are closest to uniform for any survey under consideration.
- Persistent anti-muslim bias in large language models. In Proceedings of the 2021 AAAI/ACM Conference on AI, Ethics, and Society, pages 298–306.
- Using large language models to simulate multiple humans and replicate human subject studies. In International Conference on Machine Learning, pages 337–371. PMLR.
- Out of one, many: Using language models to simulate human samples. arXiv preprint arXiv:2209.06899.
- Pythia: A suite for analyzing large language models across training and scaling. arxiv prepring arxiv:2304.01373.
- GPT-Neo: Large Scale Autoregressive Language Modeling with Mesh-Tensorflow.
- Using GPT for Market Research. Harvard Business School Marketing Unit Working Paper No. 23-062.
- Language models are few-shot learners. In Advances in Neural Information Processing Systems, pages 1877–1901.
- Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality.
- Databricks (2023). Dolly 12b.
- Can AI language models replace human participants? Trends in Cognitive Sciences.
- Retiring adult: New datasets for fair machine learning. Advances in Neural Information Processing Systems.
- Do personality tests generalize to large language models? In NeurIPS Workshop on Socially Responsible Language Modelling Research.
- Towards measuring the representation of subjective global opinions in language models. arXiv preprint arXiv:2306.16388.
- From Pretraining Data to Language Models to Downstream Tasks: Tracking the Trails of Political Biases Leading to Unfair NLP Models. Findings of the Association for Computational Linguistics: ACL 2023.
- Koala: A dialogue model for academic research. Blog post.
- Survey Methodology. Wiley.
- The political ideology of conversational AI: Converging evidence on ChatGPT’s pro-environmental, left-libertarian orientation. arXiv preprint arXiv:2301.01768.
- Measuring massive multitask language understanding. In International Conference on Learning Representations.
- Horton, J. J. (2023). Large language models as simulated economic agents: What can we learn from homo silicus? NBER Working Paper.
- CommunityLM: Probing Partisan Worldviews from Language Models. In Proceedings of the 29th International Conference on Computational Linguistics.
- AI-Augmented Surveys: Leveraging Large Language Models for Opinion Prediction in Nationally Representative Surveys. arXiv preprint arxiv:2305.09620.
- Natural questions: A benchmark for question answering research. Transactions of the Association for Computational Linguistics, 7:452–466.
- Can large language models capture public opinion about global warming? an empirical assessment of algorithmic fidelity and bias. arXiv preprint arXiv:2311.00217.
- Unqovering stereotyping biases via underspecified questions. In Findings of the Association for Computational Linguistics, pages 3475–3489.
- Holistic evaluation of language models. arXiv preprint arxiv:2211.09110.
- Fantastically ordered prompts and where to find them: Overcoming few-shot prompt order sensitivity. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 8086–8098.
- Eliciting bias in question answering models through ambiguity. In Proceedings of the 3rd Workshop on Machine Reading for Question Answering, pages 92–99.
- Do ais know what the most important issue is? using language models to code open-text social survey responses at scale. SSRN Electronic Journal.
- Can a suit of armor conduct electricity? a new dataset for open book question answering. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 2381–2391.
- MosaicML (2023). Introducing MPT-7B: A New Standard for Open-Source, Commercially Usable LLMs.
- More human than human: Measuring chatgpt political bias. Available at SSRN 4372349.
- The future of coding: A comparison of hand-coding and three types of computer-assisted text analysis methods. Sociological Methods & Research, 50(1):202–237.
- OpenAI (2023). Gpt-4 technical report. arXiv preprint arXiv:2303.08774.
- Training language models to follow instructions with human feedback. Advances in Neural Information Processing Systems.
- Discovering language model behaviors with model-written evaluations. arXiv preprint arXiv:2212.09251.
- Language models are unsupervised multitask learners. OpenAI blog, 1(8):9.
- SQuAD: 100,000+ questions for machine comprehension of text. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing.
- Leveraging large language models for multiple choice question answering. In The Eleventh International Conference on Learning Representations.
- The Self-Perception and Political Biases of ChatGPT. arXiv preprint arXiv:2304.07333.
- Demonstrations of the potential of ai-based political issue polling. arXiv preprint arXiv:2307.04781.
- Whose opinions do language models reflect? International Conference on Machine Learning.
- CommonsenseQA: A question answering challenge targeting commonsense knowledge. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 4149–4158.
- Do llms exhibit human-like response biases? a case study in survey design. arXiv preprint arXiv:2311.04076.
- Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971.
- Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288.
- Calibrate before use: Improving few-shot performance of language models. In International Conference on Machine Learning, pages 12697–12706. PMLR.
- Can large language models transform computational social science? arxiv preprint arxiv:2305.03514.