People who frequently use ChatGPT for writing tasks are accurate and robust detectors of AI-generated text (2501.15654v2)

Published 26 Jan 2025 in cs.CL and cs.AI

Abstract: In this paper, we study how well humans can detect text generated by commercial LLMs (GPT-4o, Claude, o1). We hire annotators to read 300 non-fiction English articles, label them as either human-written or AI-generated, and provide paragraph-length explanations for their decisions. Our experiments show that annotators who frequently use LLMs for writing tasks excel at detecting AI-generated text, even without any specialized training or feedback. In fact, the majority vote among five such "expert" annotators misclassifies only 1 of 300 articles, significantly outperforming most commercial and open-source detectors we evaluated even in the presence of evasion tactics like paraphrasing and humanization. Qualitative analysis of the experts' free-form explanations shows that while they rely heavily on specific lexical clues ('AI vocabulary'), they also pick up on more complex phenomena within the text (e.g., formality, originality, clarity) that are challenging to assess for automatic detectors. We release our annotated dataset and code to spur future research into both human and automated detection of AI-generated text.

Summary

The paper demonstrates that expert annotators achieve high detection accuracy, with TPRs up to 92.7% and near-perfect majority votes across multiple experiments.
The study shows that experts consistently outperform commercial detectors even when texts are paraphrased or humanized.
The analysis reveals that experts rely on lexical and stylistic markers like vocabulary and sentence structure to distinguish AI-generated from human-written text.

The paper investigates the ability of humans to detect text generated by commercial LLMs, focusing on modern models and evasion tactics such as paraphrasing and humanization. The authors hired annotators to read 300 non-fiction English articles and label them as either human-written or AI-generated, providing paragraph-length explanations for their decisions.

The key findings are:

Annotators who frequently use LLMs for writing tasks (referred to as "experts") are highly accurate at detecting AI-generated text, even without specialized training or feedback.
The majority vote among five expert annotators misclassifies only 1 of 300 articles, outperforming most commercial and open-source detectors, including those tested against paraphrasing and humanization.
Expert annotators rely on specific lexical clues ("AI vocabulary") and more complex phenomena such as formality, originality, and clarity.

The paper involved five experiments that progressively increased the difficulty of AI-generated text detection.

Experiment 1: Evaluated annotators with varying backgrounds on articles generated by \ (\scriptsize{2024-08-06}). It was found that annotators with limited LLM experience performed poorly, while those with frequent writing-related LLM usage were highly accurate. The average true positive rate (TPR) and false positive rate (FPR) for non-experts were 56.7% and 52.5%, respectively, while experts achieved a TPR of 92.7% and an average FPR of 3.3%.
Experiment 2: Examined whether experts could detect articles generated by Claude \cite{anthropicclaude}. The expert majority vote was 100% accurate, indicating that experts were able to generalize to different LLMs.
Experiment 3: Tested the robustness of experts against paraphrasing attacks using \ (\scriptsize{2024-08-06}). Experts maintained high TPR and low FPR, with the majority vote remaining 100% accurate, suggesting that paraphrasing is not effective against expert human detectors.
Experiment 4: Assessed whether experts could detect articles generated by \ following its release. Four out of five experts remained robust, with a majority vote TPR of 96.7% and FPR of 0%.
Experiment 5: Explored the impact of humanization techniques on expert detection rates. Despite the use of detailed instructions to avoid specific AI signatures, the majority of experts remained robust, with the expert majority vote being perfect on all 60 articles. However, annotator 3, who struggled to detect non-humanized \ articles in Experiment 4, performed remarkably poorly on this batch, achieving a TPR of zero.

The paper compared expert human detectors against several automatic detectors: Pangram, GPTZero, Binoculars, Fast-DetectGPT, and RADAR. Only Pangram Humanizers (average TPR of 99.3% with FPR of 2.7% for base model) matched the expert human majority vote.

A fine-grained analysis of expert explanations revealed the use of clues related to vocabulary (53.1% of explanations), sentence structure (35.9%), grammar (24.8%), and originality (23.7%). The frequency of vocabulary-related clues decreased for \ articles, especially with humanization. When experts made incorrect decisions, 31% of explanations mentioned vocabulary, while 50% focused on sentence structure.

The authors also experimented with prompting LLMs to mimic expert annotators by providing a guidebook and asking the model to produce an explanation and a label. While this approach showed promise, it failed to reach the performance of human experts and advanced automatic detectors.

The contributions of the paper include:

Demonstrating that expert annotators are highly accurate and robust detectors of AI-generated text.
Highlighting the features that expert annotators focus on, including vocabulary, sentence structure, and originality.
Releasing an annotated dataset and codebase to facilitate future research.

The authors conclude that human experts are a valuable resource for detecting AI-generated text and suggest that future work should focus on training human annotators and exploring the combination of human and automatic detection methods. The paper acknowledges limitations such as focusing on American English articles and not investigating factual accuracy.