GPT-4 is judged more human than humans in displaced and inverted Turing tests (2407.08853v1)

Published 11 Jul 2024 in cs.HC and cs.CL

Abstract: Everyday AI detection requires differentiating between people and AI in informal, online conversations. In many cases, people will not interact directly with AI systems but instead read conversations between AI systems and other people. We measured how well people and LLMs can discriminate using two modified versions of the Turing test: inverted and displaced. GPT-3.5, GPT-4, and displaced human adjudicators judged whether an agent was human or AI on the basis of a Turing test transcript. We found that both AI and displaced human judges were less accurate than interactive interrogators, with below chance accuracy overall. Moreover, all three judged the best-performing GPT-4 witness to be human more often than human witnesses. This suggests that both humans and current LLMs struggle to distinguish between the two when they are not actively interrogating the person, underscoring an urgent need for more accurate tools to detect AI in conversations.

PDF HTML Abstract

An Expert Review of "GPT-4 is judged more human than humans in displaced and inverted Turing tests"

The paper authored by Ishika Rathi, Sydney Taylor, Benjamin K. Bergen, and Cameron R. Jones presents a detailed examination of the capability of LLMs to imitate human conversational behavior convincingly enough to deceive both human and AI judges. This paper employs two novel adaptations of the classical Turing test: the inverted Turing test and the displaced Turing test.

Key Findings

The research elucidates several critical observations:

Inverted Turing Test Performance:
- Both GPT-3.5 and GPT-4 struggled to accurately distinguish between human and AI-generated conversational transcripts. They performed significantly worse than interactive human interrogators, with accuracy rates of 31.4% and 36.4% respectively.
- GPT-4, when tasked with evaluating the transcript of a top-performing GPT-4 interlocutor, judged it to be human more often (70.9%) than actual human transcripts (34.8%).
Displaced Turing Test Performance:
- Displaced human judges also exhibited lower accuracy (48.6%) compared to interactive interrogators (64.8%).
- The best GPT-4 witness achieved a pass rate of 78%, which was significantly higher than the pass rate for human witnesses (58.2%).
Learning Influences:
- Human participants displayed an increase in accuracy over the course of the trials, suggesting a learning effect.
- When GPT-4 was tested with In-Context Learning (ICL), its accuracy improved to 58%, aligning it closely with human performance in the displaced setting.
Statistical AI Detection Methods:
- Methods based on log likelihood and curvature of the text generated by AI were explored. Curvature provided higher accuracy (69%) than any adjudicator type, including interactive humans.
- However, higher variability within witness types reduced the overall discriminative power of these approaches.

Implications

The implications stemming from these findings are manifold and significant for the field of AI, particularly in applications related to conversational AI and AI-detection systems:

Challenges to AI Detection:
- Both AI and displaced human judges find it difficult to reliably discern between human and AI-generated conversations. This indicates a pressing need for developing more sophisticated AI-detection tools, especially as AI-generated content becomes more prevalent.
Human-like Deception by AIs:
- The high pass rates of the best GPT-4 witness underscore the capability of advanced LLMs to mimic human-like conversation convincingly enough to deceive both AI and human judges. This carries profound implications for online interactions, where AI-generated content might be mistaken for human dialogue.
Future Research Directions:
- The observed learning effects in displaced human adjudicators suggest that continuous exposure to AI-generated conversations could potentially improve human detection abilities. Future studies could explore the efficacy of feedback mechanisms and educational interventions to enhance human judgment accuracy.
- Additionally, refining statistical AI-detection tools by accounting for variability and contextual factors could significantly enhance their reliability and utility.

Conclusion

The research offers a nuanced view of the current state of AI capabilities in mimicking human conversation and the challenges involved in detecting AI-generated content. The findings reveal that both humans and AI adjudicators have substantial room for improvement in distinguishing between human and AI interlocutors in realistic, non-interactive scenarios. As conversational AI systems continue to evolve and proliferate, the development of more robust AI-detection methods will be critical to maintaining the integrity and trustworthiness of online communications.

PDF Markdown Bookmark Chat (Pro)

Authors (4)

Ishika Rathi (1 paper)
Sydney Taylor (7 papers)
Benjamin K. Bergen (31 papers)
Cameron R. Jones (8 papers)

Citations (3)

View on Semantic Scholar

Related Papers

Find Related Papers

Tweets

https://twitter.com/emollick/status/1827748078008635820

https://twitter.com/emollick/status/1880717680740913564

https://twitter.com/JagersbergKnut/status/1827300550892826972

https://twitter.com/AnnaRMills/status/1827752999248359605

https://twitter.com/MindBranches/status/1827865486152856001

https://twitter.com/ceobillionaire/status/1827825861468303857