Can Large Language Models Replace Human Subjects? A Large-Scale Replication of Scenario-Based Experiments in Psychology and Management (2409.00128v3)

Published 29 Aug 2024 in cs.CL, cs.AI, econ.GN, and q-fin.EC

Abstract: AI is increasingly being integrated into scientific research, particularly in the social sciences, where understanding human behavior is critical. LLMs have shown promise in replicating human-like responses in various psychological experiments. We conducted a large-scale study replicating 156 psychological experiments from top social science journals using three state-of-the-art LLMs (GPT-4, Claude 3.5 Sonnet, and DeepSeek v3). Our results reveal that while LLMs demonstrate high replication rates for main effects (73-81%) and moderate to strong success with interaction effects (46-63%), They consistently produce larger effect sizes than human studies, with Fisher Z values approximately 2-3 times higher than human studies. Notably, LLMs show significantly lower replication rates for studies involving socially sensitive topics such as race, gender and ethics. When original studies reported null findings, LLMs produced significant results at remarkably high rates (68-83%) - while this could reflect cleaner data with less noise, as evidenced by narrower confidence intervals, it also suggests potential risks of effect size overestimation. Our results demonstrate both the promise and challenges of LLMs in psychological research, offering efficient tools for pilot testing and rapid hypothesis validation while enriching rather than replacing traditional human subject studies, yet requiring more nuanced interpretation and human validation for complex social phenomena and culturally sensitive research questions.

Summary

The paper demonstrates that GPT-4 replicated 76% of main effects in 154 psychological experiments, closely mirroring human response patterns.
The study reveals that only 19.44% of GPT-4's confidence intervals captured original effect sizes, with many cases showing inflated effects and unexpected significant results.
The research highlights challenges in replicating studies on sensitive topics and complex scenarios, emphasizing the need for human validation in AI-driven psychological experiments.

Can AI Replace Human Subjects? A Large-Scale Replication of Psychological Experiments with LLMs

The paper "Can AI Replace Human Subjects? A Large-Scale Replication of Psychological Experiments with LLMs" investigates the potential of LLMs, specifically GPT-4, to replicate human behavior in psychological experiments. This paper addresses a significant question in contemporary AI and psychological science: Can AI serve as a reliable stand-in for human participants in experimental research?

The researchers conducted a comprehensive paper replicating 154 psychological experiments, encompassing 618 main effects and 138 interaction effects, using GPT-4 as a simulated participant. The experiments were sourced from top-tier social science journals, ensuring rigorous and diverse methodological frameworks. The findings reveal a nuanced understanding of the capabilities and limitations of LLMs in this context.

Key Findings

Replication Success Rates:
- GPT-4 successfully replicated 76.0% of main effects, closely matching human response patterns in direction and significance.
- Interaction effects had a lower replication rate of 47.0%, mirroring the challenges faced in human participant studies where interaction effects are inherently more complex.
Effect Size Discrepancies:
- Only 19.44% of GPT-4's replicated confidence intervals contained the original effect sizes.
- LLM-based replications tended to produce larger effect sizes, with 51.50% of original r-values falling below the lower bound of the replication’s confidence interval.
- The replication generated unexpected significant results in 71.6% of cases where the original studies reported null findings, suggesting potential overestimation or false positives by GPT-4.
Influence of Study Features:
- Studies involving socially sensitive topics such as race and ethics exhibited lower replication rates, likely due to GPT-4’s alignment with socially acceptable norms, reducing the likelihood of producing controversial responses.
- Experiments requiring scenario adaptations to suit GPT-4 had lower replication success, indicating challenges in adapting complex, nuanced scenarios for AI interpretation.
Statistical Patterns:
- The p-values from GPT-4 replications were generally smaller than those from the original studies, indicating stronger evidence against the null hypothesis.
- A substantial percentage of GPT-4 replications generated lower p-values, leading to higher rates of significance.

Implications for Psychological Science

The findings underscore the potential and limitations of using LLMs in psychological research. The high replication rates for main effects highlight the capability of GPT-4 to simulate human responses in structured scenarios. This opens avenues for utilizing LLMs in preliminary hypothesis testing and experimental design refinement, offering a cost-effective and scalable tool before committing to resource-intensive human trials.

However, the discrepancies in effect sizes and the tendency towards higher significance rates underscore the need for caution. The inflated effect sizes and unexpected significant results suggest that LLMs may amplify subtle effects or generate false positives. This overestimation necessitates validation through human studies to confirm AI-driven findings and avoid misinterpretation.

Theoretical and Practical Considerations

The paper contributes to the theoretical understanding of AI's role in social sciences. By systematically comparing LLM responses with human data, it provides insights into the cognitive mechanisms of LLMs and their interaction with experimental stimuli. This knowledge is crucial for refining LLM training methodologies and aligning AI responses with human cognition more closely.

Practically, while LLMs cannot yet fully replace human subjects' nuanced and contextually rich insights, they can complement human research by identifying potential areas of interest and refining experimental designs. This hybrid approach can accelerate psychological research, allowing for more efficient data collection and hypothesis generation.

Future Directions

Future research should focus on enhancing the interpretability of LLM outputs and reducing the risk of false positives. This includes developing methodologies to better align LLM responses with complex human behaviors and ethical considerations. Additionally, exploring the application of LLMs across a broader range of social sciences can further elucidate their potential and limitations.

As AI continues to evolve, its integration into psychological research will likely deepen, offering novel tools and perspectives that bridge the gap between human cognition and artificial intelligence. The paper lays a foundational understanding, emphasizing a cautious yet optimistic approach to incorporating LLMs in experimental research.

PDF Markdown

Related Papers

Tweets

https://twitter.com/chris_bail/status/1832444524003782788