Large Language Models Pass the Turing Test (2503.23674v1)

Published 31 Mar 2025 in cs.CL and cs.HC

Abstract: We evaluated 4 systems (ELIZA, GPT-4o, LLaMa-3.1-405B, and GPT-4.5) in two randomised, controlled, and pre-registered Turing tests on independent populations. Participants had 5 minute conversations simultaneously with another human participant and one of these systems before judging which conversational partner they thought was human. When prompted to adopt a humanlike persona, GPT-4.5 was judged to be the human 73% of the time: significantly more often than interrogators selected the real human participant. LLaMa-3.1, with the same prompt, was judged to be the human 56% of the time -- not significantly more or less often than the humans they were being compared to -- while baseline models (ELIZA and GPT-4o) achieved win rates significantly below chance (23% and 21% respectively). The results constitute the first empirical evidence that any artificial system passes a standard three-party Turing test. The results have implications for debates about what kind of intelligence is exhibited by LLMs, and the social and economic impacts these systems are likely to have.

Authors (2)

Cameron R. Jones (8 papers)
Benjamin K. Bergen (31 papers)

Summary

The paper "LLMs Pass the Turing Test" (Jones et al., 31 Mar 2025 ) investigates whether contemporary LLMs can pass a standard, pre-registered instantiation of the Turing Test. It compares the performance of GPT-4.5, LLaMa-3.1-405B, GPT-4o, and ELIZA against human participants in a simultaneous, three-party conversational setup.

Experimental Design and Methodology

The paper utilized a randomized, controlled, three-party Turing test framework. Human interrogators engaged in two concurrent 5-minute text-based conversations: one with a human witness and one with an AI witness. Both witnesses were instructed to convince the interrogator of their humanity. The experiments were pre-registered and conducted on two distinct populations: 126 undergraduate students from UC San Diego and 158 participants recruited via Prolific. Each participant completed eight rounds, serving as an interrogator four times and a witness four times.

Four AI systems were evaluated:

State-of-the-Art LLMs: GPT-4.5 and LLaMa-3.1-405B.
Baseline LLM: GPT-4o.
Classic Chatbot: ELIZA.

A critical methodological aspect was the use of specific prompting strategies for the advanced LLMs (GPT-4.5 and LLaMa-3.1). Two conditions were tested:

{NO-PERSONA}: This condition provided minimal instructions, focusing on the goal of convincing the interrogator, along with context about the game rules and recent news events.
{PERSONA}: This condition included the {NO-PERSONA} instructions but added detailed guidance for the LLM to adopt a specific persona: a young, somewhat introverted individual knowledgeable about internet culture, prone to using contemporary slang and exhibiting minor disfluencies. ELIZA and GPT-4o were only tested under conditions analogous to {NO-PERSONA}.

Interactions occurred via a standard messaging interface. Following the 5-minute conversation period, interrogators were required to identify the human witness, rate their confidence, and provide justifications for their choice. Participants were randomly assigned to roles and conversation partners. To ensure balanced exposure, AI witnesses were categorized (Control: ELIZA/GPT-4o; No-Persona: GPT-4.5/LLaMa-3.1; Persona: GPT-4.5/LLaMa-3.1), and each interrogator faced one AI from each distinct category pairing (e.g., GPT-4.5-PERSONA vs. Human, LLaMa-NO-PERSONA vs. Human) across their four rounds.

The primary outcome measure was the AI system's "win rate," defined as the proportion of trials where the interrogator incorrectly identified the AI as the human participant. Statistical analyses focused on determining if win rates significantly deviated from the 50% chance level, differed significantly from the baseline ELIZA performance, and specifically, if they were significantly above 50%. Qualitative analysis was also performed on the interrogators' reasoning and reported strategies. A total of 1023 valid game interactions were analyzed after data exclusions.

Quantitative Results and System Comparisons

The paper yielded statistically significant results demonstrating the capability of a prompted LLM to pass the Turing Test.

GPT-4.5-PERSONA: This configuration achieved an aggregate win rate of 73% across both participant pools. This performance was significantly higher than the 50% chance level (p < 0.001 in both studies), indicating that interrogators were substantially more likely to judge GPT-4.5 (when adopting the specified persona) as human compared to the actual human participant in the same trial.
LLaMa-3.1-PERSONA: This system achieved an overall win rate of 56%. While not significantly different from the 50% chance baseline across the combined studies, thus meeting the traditional criterion for passing, its performance varied between populations. It achieved a 65% win rate (significantly above chance) in the Prolific paper but only 45% (not significantly different from chance) in the undergraduate paper.
Impact of Prompting: The {PERSONA} prompt condition proved crucial. Both GPT-4.5 and LLaMa-3.1 performed significantly better with the persona prompt compared to the {NO-PERSONA} condition (p < 0.001). GPT-4.5-NO-PERSONA (36% win rate) and LLaMa-NO-PERSONA (38% win rate) did not consistently pass the test and performed significantly worse than their persona-guided counterparts.
Baseline Performance: ELIZA (23% win rate) and GPT-4o-NO-PERSONA (21% win rate) performed significantly below the 50% chance level. Their low win rates confirm the test's ability to distinguish less capable systems. Notably, the performance of the {NO-PERSONA} versions of the advanced LLMs (GPT-4.5 and LLaMa-3.1) was often statistically indistinguishable from ELIZA, highlighting that advanced model architecture alone, without appropriate interaction guidance (prompting), is insufficient to pass this specific Turing test implementation.

Table: AI System Win Rates (Overall)

System	Prompt Condition	Overall Win Rate (%)	Significantly Different from 50%?
GPT-4.5	{PERSONA}	73%	Yes (p < 0.001, > 50%)
LLaMa-3.1-405B	{PERSONA}	56%	No (overall), Yes (>50% Prolific)
LLaMa-3.1-405B	{NO-PERSONA}	38%	Yes (p < 0.05, < 50%)
GPT-4.5	{NO-PERSONA}	36%	Yes (p < 0.01, < 50%)
ELIZA	N/A ({NO-PERSONA})	23%	Yes (p < 0.001, < 50%)
GPT-4o	{NO-PERSONA}	21%	Yes (p < 0.001, < 50%)

Interrogator Judgments and Strategies

Analysis of interrogator justifications revealed a strong emphasis on linguistic style and socio-emotional cues over traditional markers of intelligence like factual knowledge or complex reasoning. Factors frequently cited included:

Linguistic Style: Use of slang, typos, capitalization, response speed, and overall formality. Humanlike disfluencies or informal language often favored an AI witness being judged human. Conversely, overly formal or perfect prose was often flagged as AI-like.
Socio-Emotional Factors: Perceived personality, humor, expression of opinions, coherence of conversational flow, and emotional responsiveness.
Knowledge and Reasoning: Interestingly, demonstrations of ignorance or uncertainty were sometimes interpreted as signs of humanity, while encyclopedic knowledge could raise suspicion.

Common interrogation strategies revolved around small talk, asking personal questions, and probing for emotional reactions. Less common but reportedly more effective strategies included asking unusual or nonsensical questions, testing for specific knowledge gaps, or attempting adversarial "jailbreak" prompts. The paper found no significant correlation between interrogator demographics (age, gender, prior LLM experience) and their accuracy in distinguishing humans from AI, suggesting a general vulnerability to deception across different user groups.

Implications and Discussion

The authors contend that these results constitute the first robust empirical evidence of an AI passing a standard, three-party Turing Test, specifically GPT-4.5 under the {PERSONA} condition (Jones et al., 31 Mar 2025 ). The findings suggest that the primary barrier to passing the test, as operationalized and judged by typical human interrogators, may no longer be general intelligence but rather the ability to convincingly mimic human socio-linguistic style and interaction patterns. The critical role of the {PERSONA} prompt highlights that achieving this mimicry often requires explicit guidance, raising questions about whether the capability resides solely in the model architecture or in the combination of model and prompt engineering. The authors argue this distinction is less relevant in practice, as promptability is an inherent feature of LLMs.

The paper highlights significant social and economic implications:

Automation: The demonstrated ability of LLMs to mimic human conversation convincingly, even briefly, points towards increased potential for automating roles involving text-based human interaction.
Deception and Security: The capacity to pass as human enables misuse cases, including sophisticated phishing, social engineering, disinformation campaigns, and the generation of "counterfeit people" online, potentially eroding trust and manipulating social dynamics.
Detection: The lack of correlation between user experience and detection accuracy underscores the challenge of identifying sophisticated AI mimics without specialized tools or strategies.

The results suggest that the Turing Test, in this form, primarily measures perceived humanlikeness rather than abstract intelligence or sentience. Passing it signifies a milestone in AI's ability to replicate human conversational behavior but does not resolve deeper philosophical questions about machine consciousness or understanding.

Conclusion

This paper provides strong quantitative evidence that GPT-4.5, when guided by a specific persona prompt, can pass a 5-minute, three-party text-based Turing Test, being misidentified as human significantly more often than actual human participants. LLaMa-3.1 also met the passing threshold under similar prompting, albeit less consistently. The findings underscore the critical role of prompting in achieving humanlike interaction and highlight the increasing sophistication of LLMs in mimicking social and linguistic cues, carrying substantial implications for human-computer interaction, automation, and societal trust.

PDF Markdown

Related Papers

Find Related Papers

Tweets

https://twitter.com/davidchalmers42/status/1907618835940729192

https://twitter.com/emollick/status/1907171468284494215

https://twitter.com/ai_database/status/1907037099431485815

https://twitter.com/KirkDBorne/status/1907788103307124772

https://twitter.com/marek_/status/1908276866370462058

https://twitter.com/ai_sentience/status/1934373169156301055

YouTube

Show All Videos

HackerNews

UCSD: Large Language Models Pass the Turing Test (90 points, 106 comments)