Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
167 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
42 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Using Large Language Models to Simulate Multiple Humans and Replicate Human Subject Studies (2208.10264v5)

Published 18 Aug 2022 in cs.CL, cs.AI, and cs.LG

Abstract: We introduce a new type of test, called a Turing Experiment (TE), for evaluating to what extent a given LLM, such as GPT models, can simulate different aspects of human behavior. A TE can also reveal consistent distortions in a LLM's simulation of a specific human behavior. Unlike the Turing Test, which involves simulating a single arbitrary individual, a TE requires simulating a representative sample of participants in human subject research. We carry out TEs that attempt to replicate well-established findings from prior studies. We design a methodology for simulating TEs and illustrate its use to compare how well different LLMs are able to reproduce classic economic, psycholinguistic, and social psychology experiments: Ultimatum Game, Garden Path Sentences, Milgram Shock Experiment, and Wisdom of Crowds. In the first three TEs, the existing findings were replicated using recent models, while the last TE reveals a "hyper-accuracy distortion" present in some LLMs (including ChatGPT and GPT-4), which could affect downstream applications in education and the arts.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (53)
  1. Out of one, many: Using language models to simulate human samples. Political Analysis, pp.  1–15, 2023. doi: 10.1017/pan.2023.2.
  2. Fine-tuning language models to find agreement among humans with diverse preferences. Advances in Neural Information Processing Systems, 35:38176–38189, 2022.
  3. Climbing towards nlu: On meaning, form, and understanding in the age of data. In Proceedings of the 58th annual meeting of the association for computational linguistics, pp.  5185–5198, 2020.
  4. Using cognitive psychology to understand gpt-3. Proceedings of the National Academy of Sciences, 120(6):e2218523120, 2023.
  5. Language (technology) is power: A critical survey of “bias” in nlp. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pp.  5454–5476, 2020.
  6. Suicide risk assessment and intervention in people with mental illness. BMJ, 351, 2015. doi: 10.1136/bmj.h4978. URL https://www.bmj.com/content/351/bmj.h4978.
  7. Man is to computer programmer as woman is to homemaker? debiasing word embeddings. In Advances in Neural Information Processing Systems (NeurIPS), 2016.
  8. Language models are few-shot learners. In Larochelle, H., Ranzato, M., Hadsell, R., Balcan, M. F., and Lin, H. (eds.), Advances in Neural Information Processing Systems, volume 33, pp.  1877–1901. Curran Associates, Inc., 2020. URL https://proceedings.neurips.cc/paper/2020/file/1457c0d6bfcb4967418bfb8ac142f64a-Paper.pdf.
  9. Identifying and manipulating the personality traits of language models. arXiv preprint arXiv:2212.10276, 2022.
  10. Palm: Scaling language modeling with pathways. arXiv preprint arXiv:2204.02311, 2022. URL https://arxiv.org/abs/2204.02311.
  11. Thematic roles assigned along the garden path linger. Cognitive psychology, 42(4):368–407, 2001.
  12. On not being led up the garden path: the use of context by the psychological syntax processor, pp.  320–358. Cambridge University Press, United States, 1985. ISBN 9780521262033. Cambridge Books Online.
  13. Darling, K. Extending legal protection to social robots: The effects of anthropomorphism, empathy, and violent behavior towards robotic objects. In Robot law. Edward Elgar Publishing, 2016.
  14. Language models show human-like content effects on reasoning. arXiv preprint arXiv:2207.07051, 2022.
  15. Chivalry and Solidarity in Ultimatum Games. Economic Inquiry, 39(2):171–188, April 2001. URL https://ideas.repec.org/a/oup/ecinqu/v39y2001i2p171-88.html.
  16. Galton, F. Vox populi. Nature, 75(7):450–451, 1907.
  17. An experimental analysis of ultimatum bargaining. Journal of Economic Behavior & Organization, 3(4):367–388, 1982. ISSN 0167-2681. doi: https://doi.org/10.1016/0167-2681(82)90011-7. URL https://www.sciencedirect.com/science/article/pii/0167268182900117.
  18. Machine intuition: Uncovering human-like intuitive decision-making in gpt-3.5. arXiv preprint arXiv:2212.05206, 2022.
  19. Measuring massive multitask language understanding. Proceedings of the International Conference on Learning Representations (ICLR), 2021.
  20. Chapter 2 - experimental economics and experimental game theory. In Glimcher, P. W. and Fehr, E. (eds.), Neuroeconomics (Second Edition), pp.  19–34. Academic Press, San Diego, second edition edition, 2014. ISBN 978-0-12-416008-8. doi: https://doi.org/10.1016/B978-0-12-416008-8.00002-4. URL https://www.sciencedirect.com/science/article/pii/B9780124160088000024.
  21. Mpi: Evaluating and inducing personality in pre-trained language models. arXiv preprint arXiv:2206.07550, 2022.
  22. Delphi: Towards machine ethics and norms. ArXiv, abs/2110.07574, 2021. URL https://arxiv.org/abs/2110.07574.
  23. Capturing failures of large language models via human cognitive biases. arXiv preprint arXiv:2202.12299, 2022.
  24. Ai personification: Estimating the personality of language models. arXiv preprint arXiv:2204.12000, 2022.
  25. Large language models are zero-shot reasoners, 2022. URL https://arxiv.org/abs/2205.11916.
  26. Korinek, A. Language models and cognitive automation for economic research. Technical report, National Bureau of Economic Research, 2023.
  27. Kosinski, M. Theory of mind may have spontaneously emerged in large language models. arXiv preprint arXiv:2302.02083, 2023.
  28. Krawczyk, D. C. Chapter 12 - social cognition: Reasoning with others. In Krawczyk, D. C. (ed.), Reasoning, pp.  283–311. Academic Press, 2018. ISBN 978-0-12-809285-9. doi: https://doi.org/10.1016/B978-0-12-809285-9.00012-0. URL https://www.sciencedirect.com/science/article/pii/B9780128092859000120.
  29. Holistic evaluation of language models, 2022. URL https://arxiv.org/abs/2211.09110.
  30. Pre-train, prompt, and predict: A systematic survey of prompting methods in natural language processing, 2021. URL https://arxiv.org/abs/2107.13586.
  31. Fantastically ordered prompts and where to find them: Overcoming few-shot prompt order sensitivity. CoRR, abs/2104.08786, 2021. URL https://arxiv.org/abs/2104.08786.
  32. Tutorial on agent-based modelling and simulation. Journal of Simulation, 4(3):151–162, 2010.
  33. Building a large annotated corpus of English: The Penn treebank. Computational Linguistics, 19(2):313–330, June 1993.
  34. White, man, and highly followed: Gender and race inequalities in twitter. In Proceedings of the International Conference on Web Intelligence, pp.  266–274, 2017.
  35. Milgram, S. Behavioral study of obedience. The Journal of abnormal and social psychology, 67(4):371, 1963.
  36. Social influence and the collective dynamics of opinion formation. PLOS ONE, 8(11):1–8, 11 2013. doi: 10.1371/journal.pone.0078433. URL https://doi.org/10.1371/journal.pone.0078433.
  37. OpenAI. GPT-4 Technical Report, March 2023. URL http://arxiv.org/abs/2303.08774. arXiv:2303.08774 [cs].
  38. Training language models to follow instructions with human feedback. Advances in Neural Information Processing Systems, 35:27730–27744, 2022.
  39. Page, S. The Difference: How the Power of Diversity Creates Better Groups, Firms, Schools, and Societies. Princeton University Press, 2007.
  40. Social simulacra: Creating populated prototypes for social computing systems. In Proceedings of the 35th Annual ACM Symposium on User Interface Software and Technology, pp.  1–18, 2022.
  41. Lingering misinterpretations in garden-path sentences: evidence from a paraphrasing task. Journal of experimental psychology. Learning, memory, and cognition, 35 1:280–5, 2009.
  42. Language models are unsupervised multitask learners. 2019. URL https://d4mucfpksywv.cloudfront.net/better-language-models/language-models.pdf.
  43. Societal biases in language generation: Progress and challenges. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pp.  4275–4293, Online, August 2021. Association for Computational Linguistics. doi: 10.18653/v1/2021.acl-long.330. URL https://aclanthology.org/2021.acl-long.330.
  44. AutoPrompt: Eliciting knowledge from language models with automatically generated prompts. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp.  4222–4235, Online, November 2020. Association for Computational Linguistics. doi: 10.18653/v1/2020.emnlp-main.346. URL https://aclanthology.org/2020.emnlp-main.346.
  45. Beyond the imitation game: Quantifying and extrapolating the capabilities of language models, 2022. URL https://arxiv.org/abs/2206.04615.
  46. Surowiecki, J. The Wisdom of Crowds. Doubleday, 2004.
  47. Turing, A. M. Computing machinery and intelligence. Mind, LIX:433–460, 1950.
  48. Ullman, T. Large language models fail on trivial alterations to theory-of-mind tasks. arXiv preprint arXiv:2302.08399, 2023.
  49. Investigating gender bias in language models using causal mediation analysis. Advances in Neural Information Processing Systems, 33:12388–12401, 2020.
  50. Can machines think? a report on turing test experiments at the royal society. Journal of experimental & Theoretical artificial Intelligence, 28(6):989–1007, 2016.
  51. Finetuned language models are zero-shot learners. arXiv preprint arXiv:2109.01652, 2021.
  52. Chain of thought prompting elicits reasoning in large language models, 2022. URL https://arxiv.org/abs/2201.11903.
  53. Wikipedia. Wikipedia:Systemic bias — Wikipedia, the free encyclopedia. http://en.wikipedia.org/w/index.php?title=Wikipedia%3ASystemic%20bias&oldid=1102157003, 2022. [Online; accessed 30-August-2022].
Citations (249)

Summary

  • The paper introduces the Turing Experiments framework that uses LLMs to simulate group human behavior in controlled subject studies.
  • The paper demonstrates that LLMs can mimic decision-making, sentence parsing, and compliance behaviors while showing hyper-accuracy distortions.
  • The paper highlights potential applications in social science research and stresses the need to address ethical challenges and model calibration.

Simulation of Human Behaviors Using LLMs: An Evaluation Through Turing Experiments

The paper "Using LLMs to Simulate Multiple Humans and Replicate Human Subject Studies" introduces an innovative approach to evaluating the capabilities of LLMs such as GPT-3, GPT-4, and their peers. By developing a framework called Turing Experiments (TEs), the authors investigate how well these models can emulate human behaviors observed in controlled experimental settings, deviating from the traditional Turing Test which benchmarks an AI's ability to mimic a single individual. This paper extends the scope of artificial intelligence evaluation by employing LLMs to simulate entire participant groups, aiming to replicate findings from established human subject studies.

Key Contributions

  1. Turing Experiments Framework: The TE framework represents a novel methodological paradigm wherein a LLM is tasked with simulating a collective pool of human-like respondents in classical psychological and social experiments. This framework provides insights into which human behaviors can be effectively modeled by LLMs and highlights consistent distortions in simulations.
  2. Simulated Experiments: Four distinct experiments were designed and executed using the TE methodology:
    • The Ultimatum Game, examining fairness and economic decision-making.
    • Garden Path Sentences, exploring human sentence parsing mechanisms.
    • The Milgram Shock Experiment, demonstrating obedience to authority.
    • The Wisdom of Crowds, involving general-knowledge estimation tasks.
  3. Assessment of Model Performance: By comparing simulated results to human studies, the analysis emphasized model fidelity in specific tasks and identified a "hyper-accuracy distortion," wherein LLMs exhibited abnormally accurate responses to factual inquiries—a potential artifact of model training and alignment procedures.

Findings

For the Ultimatum Game TE, it was found that advanced LLMs, notably the larger ones, could effectively simulate decision-making processes similar to humans. These models exhibited sensitivity to varying offer sizes and replicated gender-sensitive behavioral patterns previously recorded in human studies.

In the Garden Path Sentences paper, higher fidelity was observed with larger LMs, which correctly identified the parsing difficulties associated with garden path sentences relative to control sentences. However, smaller models exhibited lesser degrees of differentiation.

The Milgram Shock TE presented a novel challenge in simulating compliance and disobedience under social pressure. The paper interestingly revealed that LLMs follow similar defiance patterns comparable to historical human data, showing termination of experiment engagement predominantly after significant cues of disobedience.

In the Wisdom of Crowds TE, a distinction was made evident in LLM responses, where larger and more aligned models disproportionately produced exact answers to general knowledge questions—an example of hyper-accuracy distortion.

Implications and Future Work

The paper emphasizes the implications of using LLMs for simulating human-like decision-making and behavior, specifically their potential application in educational, psychological, and economic domains. It also addresses the importance of recognizing and mitigating any biases or distortions intrinsic to LMs that could impact their real-world applications.

The exploration of demographic parameters like gender and race through simulated names showcases the utility of LMs for diverse population studies and highlights the ethical considerations in deploying AI for sensitive subject simulations.

Future research directions involve refining TE methodology to encompass a broader spectrum of social, psychological, and behavioral experiments, improving model robustness and generalization abilities, and exploring more comprehensive assessments of demographic variations within simulated populations. Additionally, efforts to counteract distortions introduced by model alignment and enhance the fidelity of LMs as proxies for human participant studies are critical for advancing utility in applied settings.

In conclusion, the paper provides a substantive contribution to both the evaluation of AI capabilities and the potential of LLMs to extend into the field of social science research, promoting an understanding of the constraints and capacities of current technologies to simulate complex human behaviors.

Youtube Logo Streamline Icon: https://streamlinehq.com