Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
41 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Is ChatGPT Involved in Texts? Measure the Polish Ratio to Detect ChatGPT-Generated Text (2307.11380v2)

Published 21 Jul 2023 in cs.CL

Abstract: The remarkable capabilities of large-scale LLMs, such as ChatGPT, in text generation have impressed readers and spurred researchers to devise detectors to mitigate potential risks, including misinformation, phishing, and academic dishonesty. Despite this, most previous studies have been predominantly geared towards creating detectors that differentiate between purely ChatGPT-generated texts and human-authored texts. This approach, however, fails to work on discerning texts generated through human-machine collaboration, such as ChatGPT-polished texts. Addressing this gap, we introduce a novel dataset termed HPPT (ChatGPT-polished academic abstracts), facilitating the construction of more robust detectors. It diverges from extant corpora by comprising pairs of human-written and ChatGPT-polished abstracts instead of purely ChatGPT-generated texts. Additionally, we propose the "Polish Ratio" method, an innovative measure of the degree of modification made by ChatGPT compared to the original human-written text. It provides a mechanism to measure the degree of ChatGPT influence in the resulting text. Our experimental results show our proposed model has better robustness on the HPPT dataset and two existing datasets (HC3 and CDB). Furthermore, the "Polish Ratio" we proposed offers a more comprehensive explanation by quantifying the degree of ChatGPT involvement.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (23)
  1. Language models are unsupervised multitask learners. OpenAI blog, 1(8):9, 2019.
  2. Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901, 2020.
  3. OpenAI. Introducing chatgpt, 2023.
  4. Kalhan Rosenblatt. Chatgpt banned from new york city public schools’ devices and networks, 2023.
  5. How close is chatgpt to human experts? comparison corpus, evaluation, and detection. arXiv preprint arXiv:2301.07597, 2023.
  6. Can ai-generated text be reliably detected? arXiv preprint arXiv:2303.11156, 2023.
  7. Paraphrasing evades detectors of ai-generated text, but retrieval is an effective defense. arXiv preprint arXiv:2303.13408, 2023.
  8. The science of detecting llm-generated texts. arXiv preprint arXiv:2303.07205, 2023.
  9. Detectgpt: Zero-shot machine-generated text detection using probability curvature. arXiv preprint arXiv:2301.11305, 2023.
  10. A watermark for large language models. arXiv preprint arXiv:2301.10226, 2023.
  11. OpenAI. Ai text classifier, 2023.
  12. Chatgpt or human? detect and explain. explaining decisions of machine learning model for detecting short chatgpt-generated text, 2023.
  13. GLTR: Statistical detection and visualization of generated text. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics: System Demonstrations, pages 111–116, Florence, Italy, July 2019. Association for Computational Linguistics.
  14. Natural language watermarking via morphosyntactic alterations. Computer Speech & Language, 23(1):107–125, 2009.
  15. “why should I trust you?”: Explaining the predictions of any classifier. In Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Demonstrations, pages 97–101, San Diego, California, June 2016. Association for Computational Linguistics.
  16. A unified approach to interpreting model predictions. In I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, editors, Advances in Neural Information Processing Systems 30, pages 4765–4774. Curran Associates, Inc., 2017.
  17. Semi-supervised learning based on auto-generated lexicon using XAI in sentiment analysis. In Proceedings of the International Conference on Recent Advances in Natural Language Processing (RANLP 2021), pages 593–600, Held Online, September 2021. INCOMA Ltd.
  18. Scibert: A pretrained language model for scientific text. In EMNLP. Association for Computational Linguistics, 2019.
  19. Roberta: A robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692, 2019.
  20. Gpt detectors are biased against non-native english writers. arXiv preprint arXiv:2304.02819, 2023.
  21. Release strategies and the social impacts of language models. arXiv preprint arXiv:1908.09203, 2019.
  22. CSL: A large-scale Chinese scientific literature dataset. In Proceedings of the 29th International Conference on Computational Linguistics, pages 3917–3923, Gyeongju, Republic of Korea, October 2022. International Committee on Computational Linguistics.
  23. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288, 2023.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (3)
  1. Lingyi Yang (8 papers)
  2. Feng Jiang (97 papers)
  3. Haizhou Li (285 papers)
Citations (18)