Out of One, Many: Using Language Models to Simulate Human Samples (2209.06899v1)

Published 14 Sep 2022 in cs.LG and cs.CL

Abstract: We propose and explore the possibility that LLMs can be studied as effective proxies for specific human sub-populations in social science research. Practical and research applications of artificial intelligence tools have sometimes been limited by problematic biases (such as racism or sexism), which are often treated as uniform properties of the models. We show that the "algorithmic bias" within one such tool -- the GPT-3 LLM -- is instead both fine-grained and demographically correlated, meaning that proper conditioning will cause it to accurately emulate response distributions from a wide variety of human subgroups. We term this property "algorithmic fidelity" and explore its extent in GPT-3. We create "silicon samples" by conditioning the model on thousands of socio-demographic backstories from real human participants in multiple large surveys conducted in the United States. We then compare the silicon and human samples to demonstrate that the information contained in GPT-3 goes far beyond surface similarity. It is nuanced, multifaceted, and reflects the complex interplay between ideas, attitudes, and socio-cultural context that characterize human attitudes. We suggest that LLMs with sufficient algorithmic fidelity thus constitute a novel and powerful tool to advance understanding of humans and society across a variety of disciplines.

Citations (409)

View on Semantic Scholar

Summary

The paper demonstrates that GPT-3 can simulate human demographic responses through algorithmic fidelity when conditioned on survey-derived backstories.
It introduces evaluation criteria including a Social Science Turing Test and continuity measures to validate model outputs against human data.
Results show GPT-3 accurately mirrors voting behaviors and political attitudes, indicating a cost-effective tool for social science research.

Using GPT-3 to Simulate Demographically Conditioned Human Responses in Social Science Research

The paper presented in "Out of One, Many: Using LLMs to Simulate Human Samples" investigates the potential of leveraging large-scale LLMs, specifically GPT-3, as proxies for human sub-populations in social science research. The central thesis proposes that GPT-3 can reflect fine-grained demographic biases, thereby permitting the simulation of response distributions across various human subgroups. This capability is referred to as "algorithmic fidelity," highlighting the model's capacity to emulate complex human attitudes if appropriately conditioned.

Context and Approach

Historically, artificial intelligence models have been criticized for displaying biases, often regarded as uniform defects requiring mitigation. This paper introduces a paradigm shift, proposing that these biases can be more accurately characterized as reflective of the diverse associations between ideas, attitudes, and contexts found within human populations. The authors explore algorithmic fidelity by conditioning GPT-3 on socio-demographic backstories derived from actual survey data, including the American National Election Studies. The method involves creating "silicon samples" to compare GPT-3 outputs against human responses in controlled social science tasks.

Criteria for Algorithmic Fidelity

The authors put forth four criteria to evaluate the algorithmic fidelity of LLMs:

Social Science Turing Test: Generated responses should be indistinguishable from human responses.
Backward Continuity: Responses should correlate with the demographic background provided in the conditioning context.
Forward Continuity: Outputs should logically follow from the conditioning context.
Pattern Correspondence: The relationships among ideas, demographics, and behaviors in model outputs should match those in human data.

These criteria offer a comprehensive framework for assessing the potential of LLMs as simulators of human cognition and behavior.

Empirical Validation and Results

The paper details a series of studies employing GPT-3 to simulate responses in the domain of U.S. politics. The first paper demonstrates that GPT-3-generated lists describing political partisans are perceived similarly to human-generated lists in terms of tone and content, satisfying the Turing Test. The second paper involves voting behavior prediction, where GPT-3's outputs closely mirror human voting patterns across demographic groups, evidencing forward continuity and pattern correspondence. The final paper examines associations between multiple socio-political variables, again showing that GPT-3 can reproduce the complex relational patterns found among human subjects.

Implications and Future Work

The findings suggest that LLMs, when appropriately conditioned, can serve as effective tools in social science research, offering insights into demographic-specific attitudes and behaviors without deploying costly human surveys. The demonstrated algorithmic fidelity opens avenues for generating hypotheses and refining research methodologies prior to empirical testing with human subjects. However, the paper highlights the necessity for ongoing exploration of both the capabilities and limitations of algorithmic fidelity in diverse domains.

Conclusion

This research marks an important step in integrating AI-driven simulations into social science. While the practical applications are promising, further work is needed to establish the extent of algorithmic fidelity across different contexts and to refine the conditioning techniques employed. This paper encourages both computational and social science communities to engage in collective efforts to harness and scrutinize the capabilities of advanced LLMs in representing human social and political behavior.