Using Large Language Models to Simulate Multiple Humans and Replicate Human Subject Studies (2208.10264v5)

Published 18 Aug 2022 in cs.CL, cs.AI, and cs.LG

Abstract: We introduce a new type of test, called a Turing Experiment (TE), for evaluating to what extent a given LLM, such as GPT models, can simulate different aspects of human behavior. A TE can also reveal consistent distortions in a LLM's simulation of a specific human behavior. Unlike the Turing Test, which involves simulating a single arbitrary individual, a TE requires simulating a representative sample of participants in human subject research. We carry out TEs that attempt to replicate well-established findings from prior studies. We design a methodology for simulating TEs and illustrate its use to compare how well different LLMs are able to reproduce classic economic, psycholinguistic, and social psychology experiments: Ultimatum Game, Garden Path Sentences, Milgram Shock Experiment, and Wisdom of Crowds. In the first three TEs, the existing findings were replicated using recent models, while the last TE reveals a "hyper-accuracy distortion" present in some LLMs (including ChatGPT and GPT-4), which could affect downstream applications in education and the arts.

References (53)

Citations (249)

View on Semantic Scholar

Summary

The paper introduces the Turing Experiments framework that uses LLMs to simulate group human behavior in controlled subject studies.
The paper demonstrates that LLMs can mimic decision-making, sentence parsing, and compliance behaviors while showing hyper-accuracy distortions.
The paper highlights potential applications in social science research and stresses the need to address ethical challenges and model calibration.

Simulation of Human Behaviors Using LLMs: An Evaluation Through Turing Experiments

The paper "Using LLMs to Simulate Multiple Humans and Replicate Human Subject Studies" introduces an innovative approach to evaluating the capabilities of LLMs such as GPT-3, GPT-4, and their peers. By developing a framework called Turing Experiments (TEs), the authors investigate how well these models can emulate human behaviors observed in controlled experimental settings, deviating from the traditional Turing Test which benchmarks an AI's ability to mimic a single individual. This paper extends the scope of artificial intelligence evaluation by employing LLMs to simulate entire participant groups, aiming to replicate findings from established human subject studies.

Key Contributions

Turing Experiments Framework: The TE framework represents a novel methodological paradigm wherein a LLM is tasked with simulating a collective pool of human-like respondents in classical psychological and social experiments. This framework provides insights into which human behaviors can be effectively modeled by LLMs and highlights consistent distortions in simulations.
Simulated Experiments: Four distinct experiments were designed and executed using the TE methodology:
- The Ultimatum Game, examining fairness and economic decision-making.
- Garden Path Sentences, exploring human sentence parsing mechanisms.
- The Milgram Shock Experiment, demonstrating obedience to authority.
- The Wisdom of Crowds, involving general-knowledge estimation tasks.
Assessment of Model Performance: By comparing simulated results to human studies, the analysis emphasized model fidelity in specific tasks and identified a "hyper-accuracy distortion," wherein LLMs exhibited abnormally accurate responses to factual inquiries—a potential artifact of model training and alignment procedures.

Findings

For the Ultimatum Game TE, it was found that advanced LLMs, notably the larger ones, could effectively simulate decision-making processes similar to humans. These models exhibited sensitivity to varying offer sizes and replicated gender-sensitive behavioral patterns previously recorded in human studies.

In the Garden Path Sentences paper, higher fidelity was observed with larger LMs, which correctly identified the parsing difficulties associated with garden path sentences relative to control sentences. However, smaller models exhibited lesser degrees of differentiation.

The Milgram Shock TE presented a novel challenge in simulating compliance and disobedience under social pressure. The paper interestingly revealed that LLMs follow similar defiance patterns comparable to historical human data, showing termination of experiment engagement predominantly after significant cues of disobedience.

In the Wisdom of Crowds TE, a distinction was made evident in LLM responses, where larger and more aligned models disproportionately produced exact answers to general knowledge questions—an example of hyper-accuracy distortion.

Implications and Future Work

The paper emphasizes the implications of using LLMs for simulating human-like decision-making and behavior, specifically their potential application in educational, psychological, and economic domains. It also addresses the importance of recognizing and mitigating any biases or distortions intrinsic to LMs that could impact their real-world applications.

The exploration of demographic parameters like gender and race through simulated names showcases the utility of LMs for diverse population studies and highlights the ethical considerations in deploying AI for sensitive subject simulations.

Future research directions involve refining TE methodology to encompass a broader spectrum of social, psychological, and behavioral experiments, improving model robustness and generalization abilities, and exploring more comprehensive assessments of demographic variations within simulated populations. Additionally, efforts to counteract distortions introduced by model alignment and enhance the fidelity of LMs as proxies for human participant studies are critical for advancing utility in applied settings.

In conclusion, the paper provides a substantive contribution to both the evaluation of AI capabilities and the potential of LLMs to extend into the field of social science research, promoting an understanding of the constraints and capacities of current technologies to simulate complex human behaviors.

PDF Markdown

Related Papers

Tweets

https://twitter.com/Laz4rz/status/1802697664263299560

https://twitter.com/KovacGrgur/status/1785702585220894858

https://twitter.com/Laz4rz/status/1789665392887046465

https://twitter.com/KovacGrgur/status/1785702576190595094

https://twitter.com/Laz4rz/status/1779634086841131412

https://twitter.com/taejinception/status/1762526489504010337

YouTube

Show All Videos