Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
38 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Monitoring AI-Modified Content at Scale: A Case Study on the Impact of ChatGPT on AI Conference Peer Reviews (2403.07183v2)

Published 11 Mar 2024 in cs.CL, cs.AI, cs.LG, and cs.SI
Monitoring AI-Modified Content at Scale: A Case Study on the Impact of ChatGPT on AI Conference Peer Reviews

Abstract: We present an approach for estimating the fraction of text in a large corpus which is likely to be substantially modified or produced by a LLM. Our maximum likelihood model leverages expert-written and AI-generated reference texts to accurately and efficiently examine real-world LLM-use at the corpus level. We apply this approach to a case study of scientific peer review in AI conferences that took place after the release of ChatGPT: ICLR 2024, NeurIPS 2023, CoRL 2023 and EMNLP 2023. Our results suggest that between 6.5% and 16.9% of text submitted as peer reviews to these conferences could have been substantially modified by LLMs, i.e. beyond spell-checking or minor writing updates. The circumstances in which generated text occurs offer insight into user behavior: the estimated fraction of LLM-generated text is higher in reviews which report lower confidence, were submitted close to the deadline, and from reviewers who are less likely to respond to author rebuttals. We also observe corpus-level trends in generated text which may be too subtle to detect at the individual level, and discuss the implications of such trends on peer review. We call for future interdisciplinary work to examine how LLM use is changing our information and knowledge practices.

Monitoring AI-Modified Content at Scale in the Peer Review Process

Motivation and Approach

Peer reviews are fundamental to the scientific publication process, ensuring the relevance, rigor, and originality of scientific work. The advent of generative AI, like ChatGPT, has introduced potential changes in how reviews are composed, possibly impacting their quality and authenticity. This paper introduces a novel framework, leveraging a maximum likelihood model, to estimate the proportion of corpus content likely modified by AI at a large scale. Focusing on peer reviews from major AI conferences post-ChatGPT's release, this research uncovers patterns in AI-generated text use and discusses the broader implications for the peer review ecosystem.

Statistical Estimation Framework

At the core of this paper is a maximum likelihood estimation (MLE) approach designed to efficiently discern the extent of AI modification in large text corpora. By comparing known human-written and AI-generated documents, the framework estimates the distribution of texts in a given corpus that resemble either category. A critical aspect of this methodology is its ability to operate without the need for direct analysis of individual documents, making it vastly more computationally efficient and less prone to the biases of existing AI detection tools.

Case Study and Main Findings

The application of this framework to peer reviews from ICLR, NeurIPS, CoRL, and EMNLP conferences reveals significant insights:

  • An estimated 6.5% to 16.9% of review sentences in these conferences were substantially modified by AI.
  • Higher AI modification rates were observed in reviews submitted closer to deadlines, reviews without scholarly citations, and in reviews from authors who engaged less in the post-review discussion phase.
  • A notable correlation between the presence of AI-modified content and reduced linguistic and epistemic diversity in reviews, raising concerns about the homogenization of scholarly feedback.

These findings highlight a nuanced picture of AI use in scientific peer review, pointing to both its potential advantages in aiding reviewers and the risks it poses to the integrity and diversity of scholarly discourse.

Theoretical Implications

This paper's theoretical contributions include a robust MLE framework capable of analyzing AI-generated content across large datasets and a detailed case paper of its application within the domain of scientific peer review. The methodology provides a generalizable tool for future research into AI's impact across different information ecosystems.

Practical Implications

From a practical standpoint, this research raises important questions about the role of AI in the peer review process. The detected trends in AI use and the associated impact on review content quality and diversity underscore the need for greater transparency and guidelines around AI-assisted writing in scholarly publications. Furthermore, the findings call for interdisciplinary efforts to understand and navigate the evolving landscape of AI-generated content in scientific discourse.

Future Directions

Looking ahead, the paper advocates for continued investigation into the broad implications of LLM use in scientific communication. As AI tools become increasingly sophisticated, understanding their effects on scholarly practices, from peer review to research dissemination, will be critical. Collaborative efforts combining computational, ethical, and sociological perspectives are essential to ensure AI's responsible integration into the scientific community.

Conclusion

The exploration of AI-modified content in AI conference peer reviews post-ChatGPT reveals a complex interplay between technology and scientific communication. By providing a scalable and efficient method for estimating AI influence, this paper contributes valuable tools and insights for navigating the future of AI in academia, urging careful consideration of its benefits and challenges.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (82)
  1. Probing pre-trained language models for cross-cultural differences in values. arXiv preprint arXiv:2203.13722, 2022.
  2. Natural Language Watermarking: Design, Analysis, and a Proof-of-Concept Implementation. In Information Hiding, 2001.
  3. Natural Language Watermarking and Tamperproofing. In Information Hiding, 2002.
  4. Identifying Real or Fake Articles: Towards better Language Modeling. In International Joint Conference on Natural Language Processing, 2008. URL https://api.semanticscholar.org/CorpusID:4324753.
  5. Real or Fake? Learning to Discriminate Machine from Human Generated Text. ArXiv, abs/1906.03351, 2019. URL https://api.semanticscholar.org/CorpusID:182952342.
  6. Fast-DetectGPT: Efficient Zero-Shot Detection of Machine-Generated Text via Conditional Probability Curvature. ArXiv, abs/2310.05130, 2023. URL https://api.semanticscholar.org/CorpusID:263831345.
  7. Discourses of artificial intelligence in higher education: A critical literature review. Higher Education, 86(2):369–385, 2023.
  8. Beresneva, D. Computer-Generated Text Detection Using Machine Learning: A Systematic Review. In International Conference on Applications of Natural Language to Data Bases, 2016. URL https://api.semanticscholar.org/CorpusID:1175726.
  9. Squibs: What Is a Paraphrase? Computational Linguistics, 39:463–472, 2013. URL https://api.semanticscholar.org/CorpusID:32452685.
  10. ConDA: Contrastive Domain Adaptation for AI-generated Text Detection. ArXiv, abs/2309.03992, 2023. URL https://api.semanticscholar.org/CorpusID:261660497.
  11. Picking on the same person: Does algorithmic monoculture lead to outcome homogenization? Advances in Neural Information Processing Systems, 35:3663–3678, 2022.
  12. Cantor, M. Nearly 50 news websites are ‘AI-generated’, a study says. Would I be able to tell?, 2023. URL https://www.theguardian.com/technology/2023/may/08/ai-generated-news-websites-study. Accessed: 2024-02-24.
  13. Assessing cross-cultural alignment between chatgpt and human societies: An empirical study, 2023.
  14. On the possibilities of ai-generated text detection. arXiv preprint arXiv:2304.04736, 2023.
  15. GPT-Sentinel: Distinguishing Human and ChatGPT Generated Content. ArXiv, abs/2305.07969, 2023. URL https://api.semanticscholar.org/CorpusID:258686680.
  16. Natural Language Watermarking Using Semantic Substitution for Chinese Text. In International Workshop on Digital Watermarking, 2003. URL https://api.semanticscholar.org/CorpusID:40971354.
  17. Christin, A. What data can do: A typology of mechanisms. International Journal of Communication, 14:20, 2020.
  18. All That’s ‘Human’Is Not Gold: Evaluating Human Evaluation of Generated Text. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pp.  7282–7296, 2021.
  19. Machine Generated Text: A Comprehensive Survey of Threat Models and Detection Methods. arXiv preprint arXiv:2210.07321, 2022.
  20. Else, H. Abstracts written by ChatGPT fool scientists. Nature, Jan 2023. URL https://www.nature.com/articles/d41586-023-00056-7.
  21. TweepFake: About detecting deepfake tweets. Plos one, 16(5):e0251415, 2021.
  22. Three Bricks to Consolidate Watermarks for Large Language Models. 2023 IEEE International Workshop on Information Forensics and Security (WIFS), pp.  1–6, 2023. URL https://api.semanticscholar.org/CorpusID:260351507.
  23. Watermarking Conditional Text Generation for AI Detection: Unveiling Challenges and a Semantic-Aware Watermark Remedy. ArXiv, abs/2307.13808, 2023. URL https://api.semanticscholar.org/CorpusID:260164516.
  24. Comparing scientific abstracts generated by ChatGPT to original abstracts using an artificial intelligence output detector, plagiarism detector, and blinded human reviewers. bioRxiv, pp.  2022–12, 2022.
  25. GLTR: Statistical Detection and Visualization of Generated Text. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics: System Demonstrations, pp.  111–116, 2019.
  26. Heikkilä, M. How to spot ai-generated text. MIT Technology Review, Dec 2022. URL https://www.technologyreview.com/2022/12/19/1065596/how-to-spot-ai-generated-text/.
  27. Dialect prejudice predicts AI decisions about people’s character, employability, and criminality, 2024.
  28. SemStamp: A Semantic Watermark with Paraphrastic Robustness for Text Generation. ArXiv, abs/2310.03991, 2023. URL https://api.semanticscholar.org/CorpusID:263831179.
  29. RADAR: Robust AI-Text Detection via Adversarial Learning. ArXiv, abs/2307.03838, 2023a. URL https://api.semanticscholar.org/CorpusID:259501842.
  30. Unbiased Watermark for Large Language Models. ArXiv, abs/2310.10669, 2023b. URL https://api.semanticscholar.org/CorpusID:264172471.
  31. Automatic detection of generated text is easiest when humans are fooled. arXiv preprint arXiv:1911.00650, 2019.
  32. Automatic detection of machine generated text: A critical survey. arXiv preprint arXiv:2011.01314, 2020.
  33. Kelly, S. M. ChatGPT creator pulls AI detection tool due to ‘low rate of accuracy’. CNN Business, Jul 2023. URL https://www.cnn.com/2023/07/25/tech/openai-ai-detection-tool/index.html.
  34. A watermark for large language models. International Conference on Machine Learning, 2023.
  35. New AI classifier for indicating AI-written text, 2023. OpenAI.
  36. Algorithmic monoculture and social welfare. Proceedings of the National Academy of Sciences, 118(22):e2018340118, 2021.
  37. All the news that’s fit to fabricate: Ai-generated text as a tool of media misinformation. Journal of Experimental Political Science, 9(1):104–117, 2022. doi: 10.1017/XPS.2020.37.
  38. Robust Distortion-free Watermarks for Language Models. ArXiv, abs/2307.15593, 2023. URL https://api.semanticscholar.org/CorpusID:260315804.
  39. Lamont, M. How professors think: Inside the curious world of academic judgment. Harvard University Press, 2009.
  40. Lamont, M. Toward a comparative sociology of valuation and evaluation. Annual review of sociology, 38:201–221, 2012.
  41. Detecting Fake Content with Relative Entropy Scoring. 2008. URL https://api.semanticscholar.org/CorpusID:12098535.
  42. Deepfake Text Detection in the Wild. ArXiv, abs/2305.13242, 2023. URL https://api.semanticscholar.org/CorpusID:258832454.
  43. GPT detectors are biased against non-native English writers. ArXiv, abs/2304.02819, 2023a.
  44. Can large language models provide useful feedback on research papers? A large-scale empirical analysis. In arXiv preprint arXiv:2310.01783, 2023b.
  45. The unlocking spell on base llms: Rethinking alignment via in-context learning. arXiv preprint arXiv:2312.01552, 2023.
  46. A Semantic Invariant Robust Watermark for Large Language Models. ArXiv, abs/2310.06356, 2023. URL https://api.semanticscholar.org/CorpusID:263830310.
  47. When chatgpt is gone: Creativity reverts and homogeneity persists, 2024.
  48. CoCo: Coherence-Enhanced Machine-Generated Text Detection Under Data Limitation With Contrastive Learning. ArXiv, abs/2212.10341, 2022. URL https://api.semanticscholar.org/CorpusID:254877728.
  49. RoBERTa: A Robustly Optimized BERT Pretraining Approach. ArXiv, abs/1907.11692, 2019.
  50. Longino, H. E. Science as social knowledge: Values and objectivity in scientific inquiry. Princeton university press, 1990.
  51. Artificial intelligence and illusions ofunderstanding in scientific research. Nature, 627:49–58, 2024.
  52. DetectGPT: Zero-Shot Machine-Generated Text Detection using Probability Curvature. ArXiv, abs/2301.11305, 2023a.
  53. DetectGPT: Zero-shot machine-generated text detection using probability curvature. arXiv preprint arXiv:2301.11305, 2023b.
  54. Having beer after prayer? measuring cultural bias in large language models, 2024.
  55. Newman, D. J. The double dixie cup problem. The American Mathematical Monthly, 67(1):58–61, 1960.
  56. NewsGuard. Tracking AI-enabled Misinformation: 713 ‘Unreliable AI-Generated News’ Websites (and Counting), Plus the Top False Narratives Generated by Artificial Intelligence Tools, 2023. URL https://www.newsguardtech.com/special-reports/ai-tracking-center/. Accessed: 2024-02-24.
  57. OpenAI. GPT-2: 1.5B release. https://openai.com/research/gpt-2-1-5b-release, 2019. Accessed: 2019-11-05.
  58. Multilingual bert has an accent: Evaluating english influences on fluency in multilingual models, 2023.
  59. Exploring the limits of transfer learning with a unified text-to-text transformer. The Journal of Machine Learning Research, 21(1):5485–5551, 2020.
  60. Can AI-Generated Text be Reliably Detected? ArXiv, abs/2303.11156, 2023.
  61. Red Teaming Language Model Detectors with Language Models. ArXiv, abs/2305.19713, 2023. URL https://api.semanticscholar.org/CorpusID:258987266.
  62. New Evaluation Metrics Capture Quality Degradation due to LLM Watermarking. arXiv preprint arXiv:2312.02382, 2023.
  63. Release strategies and the social impacts of language models. arXiv preprint arXiv:1908.09203, 2019a.
  64. Release Strategies and the Social Impacts of Language Models. ArXiv, abs/1908.09203, 2019b. URL https://api.semanticscholar.org/CorpusID:201666234.
  65. Why do scientists disagree? 2023.
  66. The sociology of scientific validity: How professional networks shape judgement in peer review. Research Policy, 47(9):1825–1841, 2018.
  67. Natural language watermarking: challenges in building a practical system. In Electronic imaging, 2006a. URL https://api.semanticscholar.org/CorpusID:16650373.
  68. The hiding virtues of ambiguity: quantifiably resilient watermarking of natural language text through synonym substitutions. In Workshop on Multimedia & Security, 2006b.
  69. Intrinsic Dimension Estimation for Robust Detection of AI-Generated Texts. ArXiv, abs/2306.04723, 2023. URL https://api.semanticscholar.org/CorpusID:259108779.
  70. Authorship Attribution for Neural Text Generation. In Conference on Empirical Methods in Natural Language Processing, 2020. URL https://api.semanticscholar.org/CorpusID:221835708.
  71. Ai and science: what 1,600 researchers think. Nature, 621(7980):672–675, 2023.
  72. Fabrication and errors in the bibliographic citations generated by ChatGPT. Scientific Reports, 13(1):14045, 2023.
  73. Testing of detection tools for AI-generated text. International Journal for Educational Integrity, 19(1):26, 2023. ISSN 1833-2595. doi: 10.1007/s40979-023-00146-z. URL https://doi.org/10.1007/s40979-023-00146-z.
  74. Wolff, M. Attacking Neural Text Detectors. ArXiv, abs/2002.11768, 2020.
  75. DiPmark: A Stealthy, Efficient and Resilient Watermark for Large Language Models. ArXiv, abs/2310.07710, 2023. URL https://api.semanticscholar.org/CorpusID:263834753.
  76. DNA-GPT: Divergent N-Gram Analysis for Training-Free Detection of GPT-Generated Text. ArXiv, abs/2305.17359, 2023a. URL https://api.semanticscholar.org/CorpusID:258960101.
  77. A Survey on Detection of LLMs-Generated Content. ArXiv, abs/2310.15654, 2023b. URL https://api.semanticscholar.org/CorpusID:264439179.
  78. Robust Multi-bit Natural Language Watermarking through Invariant Features. In Annual Meeting of the Association for Computational Linguistics, 2023. URL https://api.semanticscholar.org/CorpusID:259129912.
  79. GPT Paternity Test: GPT Generated Text Detection with GPT Genetic Inheritance. ArXiv, abs/2305.12519, 2023. URL https://api.semanticscholar.org/CorpusID:258833423.
  80. Defending Against Neural Fake News. ArXiv, abs/1905.12616, 2019. URL https://api.semanticscholar.org/CorpusID:168169824.
  81. Assaying on the Robustness of Zero-Shot Machine-Generated Text Detectors. ArXiv, abs/2312.12918, 2023. URL https://api.semanticscholar.org/CorpusID:266375086.
  82. Provable Robust Watermarking for AI-Generated Text. ArXiv, abs/2306.17439, 2023. URL https://api.semanticscholar.org/CorpusID:259308864.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (12)
  1. Weixin Liang (33 papers)
  2. Zachary Izzo (11 papers)
  3. Yaohui Zhang (6 papers)
  4. Haley Lepp (5 papers)
  5. Hancheng Cao (20 papers)
  6. Xuandong Zhao (47 papers)
  7. Lingjiao Chen (27 papers)
  8. Haotian Ye (39 papers)
  9. Sheng Liu (122 papers)
  10. Zhi Huang (10 papers)
  11. Daniel A. McFarland (7 papers)
  12. James Y. Zou (7 papers)
Citations (50)
Youtube Logo Streamline Icon: https://streamlinehq.com