Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
38 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

ArguGPT: evaluating, understanding and identifying argumentative essays generated by GPT models (2304.07666v2)

Published 16 Apr 2023 in cs.CL

Abstract: AI generated content (AIGC) presents considerable challenge to educators around the world. Instructors need to be able to detect such text generated by LLMs, either with the naked eye or with the help of some tools. There is also growing need to understand the lexical, syntactic and stylistic features of AIGC. To address these challenges in English language teaching, we first present ArguGPT, a balanced corpus of 4,038 argumentative essays generated by 7 GPT models in response to essay prompts from three sources: (1) in-class or homework exercises, (2) TOEFL and (3) GRE writing tasks. Machine-generated texts are paired with roughly equal number of human-written essays with three score levels matched in essay prompts. We then hire English instructors to distinguish machine essays from human ones. Results show that when first exposed to machine-generated essays, the instructors only have an accuracy of 61% in detecting them. But the number rises to 67% after one round of minimal self-training. Next, we perform linguistic analyses of these essays, which show that machines produce sentences with more complex syntactic structures while human essays tend to be lexically more complex. Finally, we test existing AIGC detectors and build our own detectors using SVMs and RoBERTa. Results suggest that a RoBERTa fine-tuned with the training set of ArguGPT achieves above 90% accuracy in both essay- and sentence-level classification. To the best of our knowledge, this is the first comprehensive analysis of argumentative essays produced by generative LLMs. Machine-authored essays in ArguGPT and our models will be made publicly available at https://github.com/huhailinguist/ArguGPT

Definition Search Book Streamline Icon: https://streamlinehq.com
References (67)
  1. Alkaissi, Hussam and Samy I McFarlane. 2023. Artificial hallucinations in ChatGPT: Implications in scientific writing. Cureus.
  2. Palm 2 technical report. CoRR, abs/2305.10403.
  3. Bird, Steven. 2002. Nltk: The natural language toolkit. ArXiv, cs.CL/0205028.
  4. Toefl11: A corpus of non-native english. ETS Research Report Series, 2013(2):i–15.
  5. Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901.
  6. Evaluation of text generation: A survey. arXiv preprint arXiv:2006.14799.
  7. Evaluating large language models trained on code. CoRR, abs/2107.03374.
  8. How robust is gpt-3.5 to predecessors? a comprehensive study on language understanding tasks. arXiv preprint arXiv:2303.00293.
  9. ChatGPT goes to law school. SSRN Electron. J.
  10. Palm: Scaling language modeling with pathways. CoRR, abs/2204.02311.
  11. Scaling instruction-finetuned language models. CoRR, abs/2210.11416.
  12. All That’s ‘Human’ Is Not Gold: Evaluating Human Evaluation of Generated Text. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pages 7282–7296, Association for Computational Linguistics, Online.
  13. BERT: pre-training of deep bidirectional transformers for language understanding. CoRR, abs/1810.04805.
  14. Chatgpt: five priorities for research. Nature, 614:224–226.
  15. Is GPT-3 Text Indistinguishable from Human Text? Scarecrow: A Framework for Scrutinizing Machine Text. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 7250–7274, Association for Computational Linguistics, Dublin, Ireland.
  16. Mathematical capabilities of chatgpt.
  17. Comparing scientific abstracts generated by chatgpt to original abstracts using an artificial intelligence output detector, plagiarism detector, and blinded human reviewers. bioRxiv.
  18. Gltr: Statistical detection and visualization of generated text. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics: System Demonstrations, pages 111–116.
  19. Gui, Shichun and Huizhong Yang. 2003. Chinese Learner English Corpus. Shanghai Foreign Language Education Press.
  20. How close is chatgpt to human experts? comparison corpus, evaluation, and detection. arXiv preprint arXiv:2301.07597.
  21. Training compute-optimal large language models. CoRR, abs/2203.15556.
  22. Honnibal, Matthew and Mark Johnson. 2015. An improved non-monotonic transition system for dependency parsing. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, pages 1373–1378, Association for Computational Linguistics, Lisbon, Portugal.
  23. Unnatural instructions: Tuning language models with (almost) no human labor. CoRR, abs/2212.09689.
  24. Housen, Alex. 2002. A corpus-based study of the l2-acquisition of the english verb system. Computer learner corpora, second language acquisition and foreign language teaching, 6:2002–77.
  25. Hu, Hai and Sandra Kübler. 2021. Investigating translated chinese and its variants using machine learning. Natural Language Engineering, 27(3):339–372.
  26. Hunt, Kellogg W. 1965. Grammatical structures written at three grade levels. 8. National Council of Teachers of English.
  27. Hunt, Kellogg W. 1970. Do sentences in the second language grow like those in the first? Tesol Quarterly, pages 195–202.
  28. Scaling laws for neural language models. CoRR, abs/2001.08361.
  29. Large language models are zero-shot reasoners. CoRR, abs/2205.11916.
  30. Koppel, Moshe and Noam Ordan. 2011. Translationese and its dialects. In Proceedings of the 49th annual meeting of the association for computational linguistics: Human language technologies, pages 1318–1326.
  31. Performance of ChatGPT on USMLE: Potential for AI-assisted medical education using large language models. PLOS Digit. Health, 2(2):e0000198.
  32. Laufer, Batia and Paul Nation. 1995. Vocabulary size and use: Lexical richness in l2 written production. Applied linguistics, 16(3):307–322.
  33. Textbooks are all you need II: phi-1.5 technical report. CoRR, abs/2309.05463.
  34. Roberta: A robustly optimized BERT pretraining approach. CoRR, abs/1907.11692.
  35. Lu, Xiaofei. 2010. Automatic analysis of syntactic complexity in second language writing. International Journal of Corpus Linguistics, 15:474–496.
  36. Lu, Xiaofei. 2012. The relationship of lexical richness to the quality of esl learners’ oral narratives. The Modern Language Journal, 96:190–208.
  37. The stanford corenlp natural language processing toolkit. In Annual Meeting of the Association for Computational Linguistics.
  38. Cross-task generalization via natural language crowdsourcing instructions. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), ACL 2022, Dublin, Ireland, May 22-27, 2022, pages 3470–3487, Association for Computational Linguistics.
  39. Detectgpt: Zero-shot machine-generated text detection using probability curvature. arXiv preprint arXiv:2301.11305.
  40. Chatgpt or human? detect and explain. explaining decisions of machine learning model for detecting short chatgpt-generated text. arXiv preprint arXiv:2301.13852.
  41. OpenAI. 2023. Gpt-4 technical report.
  42. Training language models to follow instructions with human feedback. In Advances in Neural Information Processing Systems.
  43. Scikit-learn: Machine learning in Python. Journal of Machine Learning Research, 12:2825–2830.
  44. Improving language understanding by generative pre-training.
  45. Language models are unsupervised multitask learners. OpenAI blog, 1(8):9.
  46. Scaling language models: Methods, analysis & insights from training gopher. CoRR, abs/2112.11446.
  47. Exploring the limits of transfer learning with a unified text-to-text transformer. CoRR, abs/1910.10683.
  48. Rayson, Paul and Roger Garside. 2000. Comparing corpora using frequency profiling. In The workshop on comparing corpora, pages 1–6.
  49. Read, John. 2000. Assessing vocabulary. Cambridge university press.
  50. Multitask prompted training enables zero-shot task generalization. CoRR, abs/2110.08207.
  51. Using deepspeed and megatron to train megatron-turing NLG 530b, A large-scale generative language model. CoRR, abs/2201.11990.
  52. Thorp, H. Holden. 2023. Chatgpt is fun, but not an author. Science, 379(6630):313–313.
  53. Turing, Alan. 1950. Computing machinery and intelligence. Mind, 59(236):433.
  54. Authorship attribution for neural text generation. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 8384–8395, Association for Computational Linguistics, Online.
  55. Turingbench: A benchmark environment for turing test in the age of neural text generation. In Findings of the Association for Computational Linguistics: EMNLP 2021, pages 2001–2016.
  56. Attention is all you need. CoRR, abs/1706.03762.
  57. On the features of translationese. Digital Scholarship in the Humanities, 30(1):98–118.
  58. Self-consistency improves chain of thought reasoning in language models. CoRR, abs/2203.11171.
  59. Super-naturalinstructions: Generalization via declarative instructions on 1600+ NLP tasks. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, EMNLP 2022, Abu Dhabi, United Arab Emirates, December 7-11, 2022, pages 5085–5109, Association for Computational Linguistics.
  60. Finetuned language models are zero-shot learners. CoRR, abs/2109.01652.
  61. Chain of thought prompting elicits reasoning in large language models. CoRR, abs/2201.11903.
  62. Spoken and written english corpus of chinese learners. Foreign Language Teaching and Research Press.
  63. Transformers: State-of-the-art natural language processing. In Conference on Empirical Methods in Natural Language Processing.
  64. Bloom: A 176b-parameter open-access multilingual language model.
  65. Defending against neural fake news. Advances in neural information processing systems, 32.
  66. Zhai, Xiaoming. 2022. ChatGPT user experience: Implications for education. SSRN Electron. J.
  67. Automatically answering and generating machine learning final exams.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (8)
  1. Yikang Liu (20 papers)
  2. Ziyin Zhang (16 papers)
  3. Wanyang Zhang (1 paper)
  4. Shisen Yue (3 papers)
  5. Xiaojing Zhao (1 paper)
  6. Xinyuan Cheng (4 papers)
  7. Yiwen Zhang (87 papers)
  8. Hai Hu (23 papers)
Citations (39)
Github Logo Streamline Icon: https://streamlinehq.com

GitHub