Using Large Language Models to Simulate Multiple Humans and Replicate Human Subject Studies (2208.10264v5)
Abstract: We introduce a new type of test, called a Turing Experiment (TE), for evaluating to what extent a given LLM, such as GPT models, can simulate different aspects of human behavior. A TE can also reveal consistent distortions in a LLM's simulation of a specific human behavior. Unlike the Turing Test, which involves simulating a single arbitrary individual, a TE requires simulating a representative sample of participants in human subject research. We carry out TEs that attempt to replicate well-established findings from prior studies. We design a methodology for simulating TEs and illustrate its use to compare how well different LLMs are able to reproduce classic economic, psycholinguistic, and social psychology experiments: Ultimatum Game, Garden Path Sentences, Milgram Shock Experiment, and Wisdom of Crowds. In the first three TEs, the existing findings were replicated using recent models, while the last TE reveals a "hyper-accuracy distortion" present in some LLMs (including ChatGPT and GPT-4), which could affect downstream applications in education and the arts.
- Out of one, many: Using language models to simulate human samples. Political Analysis, pp. 1–15, 2023. doi: 10.1017/pan.2023.2.
- Fine-tuning language models to find agreement among humans with diverse preferences. Advances in Neural Information Processing Systems, 35:38176–38189, 2022.
- Climbing towards nlu: On meaning, form, and understanding in the age of data. In Proceedings of the 58th annual meeting of the association for computational linguistics, pp. 5185–5198, 2020.
- Using cognitive psychology to understand gpt-3. Proceedings of the National Academy of Sciences, 120(6):e2218523120, 2023.
- Language (technology) is power: A critical survey of “bias” in nlp. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pp. 5454–5476, 2020.
- Suicide risk assessment and intervention in people with mental illness. BMJ, 351, 2015. doi: 10.1136/bmj.h4978. URL https://www.bmj.com/content/351/bmj.h4978.
- Man is to computer programmer as woman is to homemaker? debiasing word embeddings. In Advances in Neural Information Processing Systems (NeurIPS), 2016.
- Language models are few-shot learners. In Larochelle, H., Ranzato, M., Hadsell, R., Balcan, M. F., and Lin, H. (eds.), Advances in Neural Information Processing Systems, volume 33, pp. 1877–1901. Curran Associates, Inc., 2020. URL https://proceedings.neurips.cc/paper/2020/file/1457c0d6bfcb4967418bfb8ac142f64a-Paper.pdf.
- Identifying and manipulating the personality traits of language models. arXiv preprint arXiv:2212.10276, 2022.
- Palm: Scaling language modeling with pathways. arXiv preprint arXiv:2204.02311, 2022. URL https://arxiv.org/abs/2204.02311.
- Thematic roles assigned along the garden path linger. Cognitive psychology, 42(4):368–407, 2001.
- On not being led up the garden path: the use of context by the psychological syntax processor, pp. 320–358. Cambridge University Press, United States, 1985. ISBN 9780521262033. Cambridge Books Online.
- Darling, K. Extending legal protection to social robots: The effects of anthropomorphism, empathy, and violent behavior towards robotic objects. In Robot law. Edward Elgar Publishing, 2016.
- Language models show human-like content effects on reasoning. arXiv preprint arXiv:2207.07051, 2022.
- Chivalry and Solidarity in Ultimatum Games. Economic Inquiry, 39(2):171–188, April 2001. URL https://ideas.repec.org/a/oup/ecinqu/v39y2001i2p171-88.html.
- Galton, F. Vox populi. Nature, 75(7):450–451, 1907.
- An experimental analysis of ultimatum bargaining. Journal of Economic Behavior & Organization, 3(4):367–388, 1982. ISSN 0167-2681. doi: https://doi.org/10.1016/0167-2681(82)90011-7. URL https://www.sciencedirect.com/science/article/pii/0167268182900117.
- Machine intuition: Uncovering human-like intuitive decision-making in gpt-3.5. arXiv preprint arXiv:2212.05206, 2022.
- Measuring massive multitask language understanding. Proceedings of the International Conference on Learning Representations (ICLR), 2021.
- Chapter 2 - experimental economics and experimental game theory. In Glimcher, P. W. and Fehr, E. (eds.), Neuroeconomics (Second Edition), pp. 19–34. Academic Press, San Diego, second edition edition, 2014. ISBN 978-0-12-416008-8. doi: https://doi.org/10.1016/B978-0-12-416008-8.00002-4. URL https://www.sciencedirect.com/science/article/pii/B9780124160088000024.
- Mpi: Evaluating and inducing personality in pre-trained language models. arXiv preprint arXiv:2206.07550, 2022.
- Delphi: Towards machine ethics and norms. ArXiv, abs/2110.07574, 2021. URL https://arxiv.org/abs/2110.07574.
- Capturing failures of large language models via human cognitive biases. arXiv preprint arXiv:2202.12299, 2022.
- Ai personification: Estimating the personality of language models. arXiv preprint arXiv:2204.12000, 2022.
- Large language models are zero-shot reasoners, 2022. URL https://arxiv.org/abs/2205.11916.
- Korinek, A. Language models and cognitive automation for economic research. Technical report, National Bureau of Economic Research, 2023.
- Kosinski, M. Theory of mind may have spontaneously emerged in large language models. arXiv preprint arXiv:2302.02083, 2023.
- Krawczyk, D. C. Chapter 12 - social cognition: Reasoning with others. In Krawczyk, D. C. (ed.), Reasoning, pp. 283–311. Academic Press, 2018. ISBN 978-0-12-809285-9. doi: https://doi.org/10.1016/B978-0-12-809285-9.00012-0. URL https://www.sciencedirect.com/science/article/pii/B9780128092859000120.
- Holistic evaluation of language models, 2022. URL https://arxiv.org/abs/2211.09110.
- Pre-train, prompt, and predict: A systematic survey of prompting methods in natural language processing, 2021. URL https://arxiv.org/abs/2107.13586.
- Fantastically ordered prompts and where to find them: Overcoming few-shot prompt order sensitivity. CoRR, abs/2104.08786, 2021. URL https://arxiv.org/abs/2104.08786.
- Tutorial on agent-based modelling and simulation. Journal of Simulation, 4(3):151–162, 2010.
- Building a large annotated corpus of English: The Penn treebank. Computational Linguistics, 19(2):313–330, June 1993.
- White, man, and highly followed: Gender and race inequalities in twitter. In Proceedings of the International Conference on Web Intelligence, pp. 266–274, 2017.
- Milgram, S. Behavioral study of obedience. The Journal of abnormal and social psychology, 67(4):371, 1963.
- Social influence and the collective dynamics of opinion formation. PLOS ONE, 8(11):1–8, 11 2013. doi: 10.1371/journal.pone.0078433. URL https://doi.org/10.1371/journal.pone.0078433.
- OpenAI. GPT-4 Technical Report, March 2023. URL http://arxiv.org/abs/2303.08774. arXiv:2303.08774 [cs].
- Training language models to follow instructions with human feedback. Advances in Neural Information Processing Systems, 35:27730–27744, 2022.
- Page, S. The Difference: How the Power of Diversity Creates Better Groups, Firms, Schools, and Societies. Princeton University Press, 2007.
- Social simulacra: Creating populated prototypes for social computing systems. In Proceedings of the 35th Annual ACM Symposium on User Interface Software and Technology, pp. 1–18, 2022.
- Lingering misinterpretations in garden-path sentences: evidence from a paraphrasing task. Journal of experimental psychology. Learning, memory, and cognition, 35 1:280–5, 2009.
- Language models are unsupervised multitask learners. 2019. URL https://d4mucfpksywv.cloudfront.net/better-language-models/language-models.pdf.
- Societal biases in language generation: Progress and challenges. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pp. 4275–4293, Online, August 2021. Association for Computational Linguistics. doi: 10.18653/v1/2021.acl-long.330. URL https://aclanthology.org/2021.acl-long.330.
- AutoPrompt: Eliciting knowledge from language models with automatically generated prompts. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 4222–4235, Online, November 2020. Association for Computational Linguistics. doi: 10.18653/v1/2020.emnlp-main.346. URL https://aclanthology.org/2020.emnlp-main.346.
- Beyond the imitation game: Quantifying and extrapolating the capabilities of language models, 2022. URL https://arxiv.org/abs/2206.04615.
- Surowiecki, J. The Wisdom of Crowds. Doubleday, 2004.
- Turing, A. M. Computing machinery and intelligence. Mind, LIX:433–460, 1950.
- Ullman, T. Large language models fail on trivial alterations to theory-of-mind tasks. arXiv preprint arXiv:2302.08399, 2023.
- Investigating gender bias in language models using causal mediation analysis. Advances in Neural Information Processing Systems, 33:12388–12401, 2020.
- Can machines think? a report on turing test experiments at the royal society. Journal of experimental & Theoretical artificial Intelligence, 28(6):989–1007, 2016.
- Finetuned language models are zero-shot learners. arXiv preprint arXiv:2109.01652, 2021.
- Chain of thought prompting elicits reasoning in large language models, 2022. URL https://arxiv.org/abs/2201.11903.
- Wikipedia. Wikipedia:Systemic bias — Wikipedia, the free encyclopedia. http://en.wikipedia.org/w/index.php?title=Wikipedia%3ASystemic%20bias&oldid=1102157003, 2022. [Online; accessed 30-August-2022].