Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
41 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Can AI-Generated Text be Reliably Detected? (2303.11156v3)

Published 17 Mar 2023 in cs.CL, cs.AI, and cs.LG
Can AI-Generated Text be Reliably Detected?

Abstract: The unregulated use of LLMs can potentially lead to malicious consequences such as plagiarism, generating fake news, spamming, etc. Therefore, reliable detection of AI-generated text can be critical to ensure the responsible use of LLMs. Recent works attempt to tackle this problem either using certain model signatures present in the generated text outputs or by applying watermarking techniques that imprint specific patterns onto them. In this paper, we show that these detectors are not reliable in practical scenarios. In particular, we develop a recursive paraphrasing attack to apply on AI text, which can break a whole range of detectors, including the ones using the watermarking schemes as well as neural network-based detectors, zero-shot classifiers, and retrieval-based detectors. Our experiments include passages around 300 tokens in length, showing the sensitivity of the detectors even in the case of relatively long passages. We also observe that our recursive paraphrasing only degrades text quality slightly, measured via human studies, and metrics such as perplexity scores and accuracy on text benchmarks. Additionally, we show that even LLMs protected by watermarking schemes can be vulnerable against spoofing attacks aimed to mislead detectors to classify human-written text as AI-generated, potentially causing reputational damages to the developers. In particular, we show that an adversary can infer hidden AI text signatures of the LLM outputs without having white-box access to the detection method. Finally, we provide a theoretical connection between the AUROC of the best possible detector and the Total Variation distance between human and AI text distributions that can be used to study the fundamental hardness of the reliable detection problem for advanced LLMs. Our code is publicly available at https://github.com/vinusankars/Reliability-of-AI-text-detectors.

Analysis of Detector Performance and Total Variation in AI Text Generation

The paper presents a comprehensive examination of AI-generated text detection, focusing on the framework for analyzing the performance of detectors and the implications of total variation (TV) on the area under the receiver operating characteristic curve (AUROC). The problem of detecting AI-generated text is nuanced and requires intelligent models to distinguish between human-generated and AI-generated content effectively. This paper builds on the premise that the total variation, a statistical measure of distribution divergence, plays a crucial role in bound-setting for detectors.

Theoretical Contributions

The authors derive a mathematical upper bound for the AUROC value of any given detector based on the total variation between the machine-generated text distribution (M\mathcal{M}) and human text distribution (H\mathcal{H}). They assert, through formal proof, that the AUROC is strictly bounded by the expression:

AUROC(D)12+TV(M,H)TV(M,H)22\mathsf{AUROC}(D) \leq \frac{1}{2} + \mathsf{TV}(\mathcal{M}, \mathcal{H}) - \frac{\mathsf{TV}(\mathcal{M}, \mathcal{H})^2}{2}

This theoretical cornerstone of the paper underscores the inherent limitation of text detectors, as precise discrimination is suppressed by the degree of overlap (TV) between the two distributions.

Experimental Analysis with GPT-3 Models

To empirically validate their theoretical findings, the authors conducted experiments using various editions of the GPT-3 LLMs: Ada, Babbage, and Curie. The models' outputs were compared against the WebText and ArXiv datasets to estimate the total variation and consequently analyze detection performance across varying text lengths. The experimental results indicate the model with superior generative capabilities, Curie, showed lower total variation scores compared to less powerful models like Ada, suggesting its outputs were more akin to human text.

When applied to more niche datasets like ArXiv abstracts, the total variation once again decreased in more capable models, further cementing the idea that with increased LLM efficacy, the detection of AI-generated texts becomes increasingly challenging.

Implications for AI Text Detection

The implications of this research extend to crafting strategies for robust AI-text detectors that can operate within the mathematical limits detailed. The bound on AUROC implies a fundamental challenge in distinguishing sophisticated LLM outputs, motivating ongoing improvements in detection methodologies such as neural network-based detectors, zero-shot approaches, and watermarking technologies. Additionally, the increasing similarity between human and AI text emphasizes the need for novel detection strategies, as current methodologies may falter with advancing LLMs.

Future Considerations

The paper also anticipates future advances in both the capability of model generators and adversarial approaches to overcome detector strategies. Improved paraphrasing models and strategic use of prompts that evoke low entropy outputs could pose significant threats to detector accuracy. Therefore, future research directions could focus on developing adaptive detection paradigms that preemptively counteract these sophisticated evasion techniques.

In conclusion, this paper provides a mathematical and empirical framework for understanding the limitations of AI-generated text detectors in the presence of high-performing LLMs. Its contributions highlight the critical intersection of statistical metrics and practical AI deployment, offering valuable insights for advancing detection methodologies in response to ever-evolving generative technologies.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (42)
  1. A watermark for large language models. arXiv preprint arXiv:2301.10226, 2023a.
  2. Detectgpt: Zero-shot machine-generated text detection using probability curvature. arXiv preprint arXiv:2301.11305, 2023a.
  3. OpenAI. Gpt-2: 1.5b release. November 2019. URL https://openai.com/research/gpt-2-1-5b-release.
  4. Paraphrasing evades detectors of ai-generated text, but retrieval is an effective defense, 2023.
  5. High-resolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10684–10695, 2022.
  6. Photorealistic text-to-image diffusion models with deep language understanding. arXiv preprint arXiv:2205.11487, 2022.
  7. Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901, 2020.
  8. Opt: Open pre-trained transformer language models, 2022. URL https://arxiv.org/abs/2205.01068.
  9. Exploring the limits of transfer learning with a unified text-to-text transformer, 2019. URL https://arxiv.org/abs/1910.10683.
  10. OpenAI. Chatgpt: Optimizing language models for dialogue. November 2022. URL https://openai.com/blog/chatgpt/.
  11. Generating sentiment-preserving fake online reviews using neural language models and their human-and machine-based detection. In Advanced Information Networking and Applications: Proceedings of the 34th International Conference on Advanced Information Networking and Applications (AINA-2020), pages 1341–1354. Springer, 2020.
  12. Max Weiss. Deepfake bot submissions to federal public comment websites cannot be distinguished from human submissions. Technology Science, 2019121801, 2019.
  13. Jon Christian. Cnet secretly used ai on articles that didn’t disclose that fact, staff say. January 2023. URL https://futurism.com/cnet-ai-articles-label.
  14. Automatic detection of machine generated text: A critical survey. arXiv preprint arXiv:2011.01314, 2020.
  15. Real or fake? learning to discriminate machine from human generated text. arXiv preprint arXiv:1906.03351, 2019.
  16. Tweepfake: About detecting deepfake tweets. arxiv. arXiv preprint arXiv:2008.00036, 2020.
  17. Roberta: A robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692, 2019.
  18. Explaining and harnessing adversarial examples. arXiv preprint arXiv:1412.6572, 2014.
  19. Cuda: Convolution-based unlearnable datasets. arXiv preprint arXiv:2303.04278, 2023.
  20. Certifying model accuracy under distribution shifts. arXiv preprint arXiv:2201.12440, 2022.
  21. Improved certified defenses against data poisoning with (deterministic) finite aggregation. In International Conference on Machine Learning, pages 22769–22783. PMLR, 2022.
  22. Release strategies and the social impacts of language models. arXiv preprint arXiv:1908.09203, 2019.
  23. Automatic detection of generated text is easiest when humans are fooled. arXiv preprint arXiv:1911.00650, 2019.
  24. Gltr: Statistical detection and visualization of generated text. arXiv preprint arXiv:1906.04043, 2019.
  25. Natural language watermarking: Design, analysis, and a proof-of-concept implementation. In Information Hiding: 4th International Workshop, IH 2001 Pittsburgh, PA, USA, April 25–27, 2001 Proceedings 4, pages 185–200. Springer, 2001.
  26. Linguistic steganography on twitter: hierarchical language modeling with manual interaction. In Media Watermarking, Security, and Forensics 2014, volume 9028, pages 9–25. SPIE, 2014.
  27. Protecting language generation models via invisible watermarking. arXiv preprint arXiv:2302.03162, 2023.
  28. Max Wolff. Attacking neural text detectors. CoRR, abs/2002.11768, 2020. URL https://arxiv.org/abs/2002.11768.
  29. Scott Aaronson. My ai safety lecture for ut effective altruism. November 2022. URL https://scottaaronson.blog/?p=6823.
  30. Gpt detectors are biased against non-native english writers. arXiv preprint arXiv:2304.02819, 2023.
  31. Pegasus: Pre-training with extracted gap-sentences for abstractive summarization, 2019.
  32. Prithiviraj Damodaran. Parrot: Paraphrase generation for nlu., 2021.
  33. Large language models can be guided to evade ai-generated text detection, 2023.
  34. On the reliability of watermarks for large language models, 2023b.
  35. Don’t give me the details, just the summary! topic-aware convolutional neural networks for extreme summarization. ArXiv, abs/1808.08745, 2018.
  36. Language models are unsupervised multitask learners. 2019.
  37. Comparison of two pseudo-random number generators. In Advances in Cryptology: Proceedings of CRYPTO ’82, pages 61–78. Plenum, 1982.
  38. How to generate cryptographically strong sequences of pseudorandom bits. SIAM Journal on Computing, 13(4):850–864, 1984. doi: 10.1137/0213053. URL https://doi.org/10.1137/0213053.
  39. On the use of arxiv as a dataset, 2019.
  40. OpenAI. Gpt-4 technical report. March 2023. URL https://cdn.openai.com/papers/gpt-4.pdf.
  41. On the possibilities of ai-generated text detection, 2023.
  42. Detectgpt: Zero-shot machine-generated text detection using probability curvature. OpenReview, 2023b. URL https://openreview.net/pdf?id=UiAyIILXRd.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (5)
  1. Vinu Sankar Sadasivan (9 papers)
  2. Aounon Kumar (16 papers)
  3. Sriram Balasubramanian (20 papers)
  4. Wenxiao Wang (63 papers)
  5. Soheil Feizi (127 papers)
Citations (308)
Github Logo Streamline Icon: https://streamlinehq.com
Youtube Logo Streamline Icon: https://streamlinehq.com