Do LLMs exhibit human-like response biases? A case study in survey design (2311.04076v5)
Abstract: As LLMs become more capable, there is growing excitement about the possibility of using LLMs as proxies for humans in real-world tasks where subjective labels are desired, such as in surveys and opinion polling. One widely-cited barrier to the adoption of LLMs as proxies for humans in subjective tasks is their sensitivity to prompt wording - but interestingly, humans also display sensitivities to instruction changes in the form of response biases. We investigate the extent to which LLMs reflect human response biases, if at all. We look to survey design, where human response biases caused by changes in the wordings of "prompts" have been extensively explored in social psychology literature. Drawing from these works, we design a dataset and framework to evaluate whether LLMs exhibit human-like response biases in survey questionnaires. Our comprehensive evaluation of nine models shows that popular open and commercial LLMs generally fail to reflect human-like behavior, particularly in models that have undergone RLHF. Furthermore, even if a model shows a significant change in the same direction as humans, we find that they are sensitive to perturbations that do not elicit significant changes in humans. These results highlight the pitfalls of using LLMs as human proxies, and underscore the need for finer-grained characterizations of model behavior. Our code, dataset, and collected samples are available at https://github.com/lindiatjuatja/BiasMonkey
- Language models show human-like content effects on reasoning, 2022.
- Collateral facilitation in humans and language models. In Proceedings of the 26th Conference on Computational Natural Language Learning (CoNLL), pages 13–26, 2022.
- Petter Törnberg. ChatGPT-4 Outperforms Experts and Crowd Workers in Annotating Political Twitter Messages with Zero-Shot Learning, April 2023. arXiv:2304.06588 [cs].
- Using large language models to simulate multiple humans and replicate human subject studies. In International Conference on Machine Learning, pages 337–371. PMLR, 2023.
- Whose opinions do language models reflect? In Proceedings of the 40th International Conference on Machine Learning, ICML’23. JMLR.org, 2023.
- Can ai language models replace human participants? Trends in Cognitive Sciences, 2023.
- John J Horton. Large language models as simulated economic agents: What can we learn from homo silicus? Working Paper 31122, National Bureau of Economic Research, April 2023.
- Use-case-grounded simulations for explanation evaluation. In S. Koyejo, S. Mohamed, A. Agarwal, D. Belgrave, K. Cho, and A. Oh, editors, Advances in Neural Information Processing Systems, volume 35, pages 1764–1775. Curran Associates, Inc., 2022.
- How can we know what language models know? Transactions of the Association for Computational Linguistics, 8:423–438, 2020.
- Making pre-trained language models better few-shot learners. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pages 3816–3830, 2021.
- The effect of the question on survey responses: A review. Journal of the Royal Statistical Society Series A: Statistics in Society, 145(1):42–57, 1982.
- An introduction to survey research, polling, and data analysis. Sage, 1996.
- Ian Brace. Questionnaire design: How to plan, structure and write survey material for effective market research. Kogan Page Publishers, 2018.
- Eli P Cox III. The optimal number of response alternatives for a scale: A review. Journal of marketing research, 17(4):407–422, 1980.
- Questions and answers in attitude surveys: Experiments on question form, wording, and context. Sage, 1996.
- Intensity measures of consumer preference. Operations Research, 28(2):278–320, 1980.
- Do polls reflect opinions or do opinions reflect polls? the impact of political polling on voters’ expectations, preferences, and behavior. Journal of Consumer Research, 23(1):53–67, 1996.
- Patient satisfaction survey as a tool towards quality improvement. Oman medical journal, 29(1):3, 2014.
- Scaling instruction-finetuned language models. arXiv preprint arXiv:2210.11416, 2022.
- Towards measuring the representation of subjective global opinions in language models, 2023.
- Out of One, Many: Using Language Models to Simulate Human Samples. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 819–862, 2022. doi: 10.18653/v1/2022.acl-long.60. arXiv:2209.06899 [cs].
- Randall A Gordon. Social desirability bias: A demonstration and technique for its reduction. Teaching of Psychology, 14(1):40–42, 1987.
- Sam G McFarland. Effects of question order on survey responses. Public Opinion Quarterly, 45(2):208–215, 1981.
- Peer reviewed: a catalog of biases in questionnaires. Preventing chronic disease, 2(1), 2005.
- Response effects in surveys. In Social information processing and survey methodology, pages 102–122. Springer, 1987.
- Response effects in mail surveys. Public Opinion Quarterly, 54(2):229–247, 1990.
- Middle alternatives, acquiescence, and the quality of questionnaire data. Irving B. Harris Graduate School of Public Policy Studies, University of Chicago, 2001.
- Graham Rawlinson. The significance of letter position in word recognition. IEEE Aerospace and Electronic Systems Magazine, 22(1):26–27, 2007.
- Structural persistence in language models: Priming as a window into abstract language representations. Transactions of the Association for Computational Linguistics, 10:1031–1050, 2022.
- On large language models’ selection bias in multi-choice questions. arXiv preprint arXiv:2309.03882, 2023.
- Large language models sensitivity to the order of options in multiple-choice questions. arXiv preprint arXiv:2308.11483, 2023.
- Llama 2: Open foundation and fine-tuned chat models, 2023.
- Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901, 2020.
- Orca: Progressive learning from complex explanation traces of gpt-4. arXiv preprint arXiv:2306.02707, 2023.
- Stanford alpaca: An instruction-following llama model. https://github.com/tatsu-lab/stanford_alpaca, 2023.
- Inverse scaling: When bigger isn’t better. Transactions on Machine Learning Research, 2023. ISSN 2835-8856. Featured Certification.
- Syntax and semantics meet in the “middle”: Probing the syntax-semantics interface of lms through agentivity. In STARSEM, 2023.
- Show your work: Scratchpads for intermediate computation with language models. arXiv preprint arXiv:2112.00114, 2021.
- Chain-of-thought prompting elicits reasoning in large language models. In S. Koyejo, S. Mohamed, A. Agarwal, D. Belgrave, K. Cho, and A. Oh, editors, Advances in Neural Information Processing Systems, volume 35, pages 24824–24837. Curran Associates, Inc., 2022a.
- Large language models are zero-shot reasoners. In S. Koyejo, S. Mohamed, A. Agarwal, D. Belgrave, K. Cho, and A. Oh, editors, Advances in Neural Information Processing Systems, volume 35, pages 22199–22213. Curran Associates, Inc., 2022.
- Universal adversarial triggers for attacking and analyzing NLP. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 2153–2162, Hong Kong, China, November 2019. Association for Computational Linguistics. doi: 10.18653/v1/D19-1221. URL https://aclanthology.org/D19-1221.
- Ignore previous prompt: Attack techniques for language models. arXiv preprint arXiv:2211.09527, 2022.
- Black box adversarial prompting for foundation models. In The Second Workshop on New Frontiers in Adversarial Machine Learning, 2023.
- Universal and transferable adversarial attacks on aligned language models, 2023.
- Fantastically ordered prompts and where to find them: Overcoming few-shot prompt order sensitivity. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 8086–8098, Dublin, Ireland, May 2022. Association for Computational Linguistics. doi: 10.18653/v1/2022.acl-long.556.
- Capturing failures of large language models via human cognitive biases. In S. Koyejo, S. Mohamed, A. Agarwal, D. Belgrave, K. Cho, and A. Oh, editors, Advances in Neural Information Processing Systems, volume 35, pages 11785–11799. Curran Associates, Inc., 2022. URL https://proceedings.neurips.cc/paper_files/paper/2022/file/4d13b2d99519c5415661dad44ab7edcd-Paper-Conference.pdf.
- Do prompt-based models really understand the meaning of their prompts? In Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 2300–2344, Seattle, United States, July 2022. Association for Computational Linguistics. doi: 10.18653/v1/2022.naacl-main.167.
- Are language models worse than humans at following prompts? it’s complicated, 2023.
- Social simulacra: Creating populated prototypes for social computing systems. In Proceedings of the 35th Annual ACM Symposium on User Interface Software and Technology, pages 1–18, 2022.
- Generative agents: Interactive simulacra of human behavior. arXiv preprint arXiv:2304.03442, 2023a.
- Artificial intelligence in psychology research. arXiv preprint arXiv:2302.07267, 2023b.
- Evaluating Large Language Models in Generating Synthetic HCI Research Data: a Case Study. In Proceedings of the 2023 CHI Conference on Human Factors in Computing Systems, CHI ’23, pages 1–19, New York, NY, USA, April 2023. Association for Computing Machinery. ISBN 978-1-4503-9421-5. doi: 10.1145/3544548.3580688.
- Chatgpt outperforms crowd workers for text-annotation tasks. Proceedings of the National Academy of Sciences of the United States of America, 120, 2023.
- Language Models Trained on Media Diets Can Predict Public Opinion, March 2023. arXiv:2303.16779 [cs].
- Ai-augmented surveys: Leveraging large language models for opinion prediction in nationally representative surveys, 2023.
- Evaluating the moral beliefs encoded in llms. arXiv preprint arXiv:2307.14324, 2023.
- Finetuned language models are zero-shot learners. In International Conference on Learning Representations, 2022b.
- Multitask prompted training enables zero-shot task generalization. In International Conference on Learning Representations, 2022.
- Training language models to follow instructions with human feedback. Advances in Neural Information Processing Systems, 35:27730–27744, 2022.
- McKee J McClendon. Acquiescence and recency response-order effects in interview surveys. Sociological Methods & Research, 20(1):60–103, 1991.
- Response order effects in the youth tobacco survey: Results of a split-ballot experiment. Survey practice, 7(3), 2014.
- Lindia Tjuatja (9 papers)
- Valerie Chen (23 papers)
- Sherry Tongshuang Wu (7 papers)
- Ameet Talwalkar (89 papers)
- Graham Neubig (342 papers)