MUGC: Machine Generated versus User Generated Content Detection (2403.19725v1)
Abstract: As advanced modern systems like deep neural networks (DNNs) and generative AI continue to enhance their capabilities in producing convincing and realistic content, the need to distinguish between user-generated and machine generated content is becoming increasingly evident. In this research, we undertake a comparative evaluation of eight traditional machine-learning algorithms to distinguish between machine-generated and human-generated data across three diverse datasets: Poems, Abstracts, and Essays. Our results indicate that traditional methods demonstrate a high level of accuracy in identifying machine-generated data, reflecting the documented effectiveness of popular pre-trained models like RoBERT. We note that machine-generated texts tend to be shorter and exhibit less word variety compared to human-generated content. While specific domain-related keywords commonly utilized by humans, albeit disregarded by current LLMs, may contribute to this high detection accuracy, we show that deeper word representations like word2vec can capture subtle semantic variances. Furthermore, readability, bias, moral, and affect comparisons reveal a discernible contrast between machine-generated and human generated content. There are variations in expression styles and potentially underlying biases in the data sources (human and machine-generated). This study provides valuable insights into the advancing capacities and challenges associated with machine-generated content across various domains.
- Generating sentiment-preserving fake online reviews using neural language models and their human- and machine-based detection. In International Conference on Advanced Information Networking and Applications, 2019.
- Large linguistic models: Analyzing theoretical linguistic abilities of llms, 2023.
- Detecting bot-generated text by characterizing linguistic accommodation in human-bot interactions. In Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021, pages 3235–3247. Association for Computational Linguistics, 2021.
- All that’s ‘human’ is not gold: Evaluating human evaluation of generated text. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pages 7282–7296. Association for Computational Linguistics, August 2021.
- Fabio Duarte. Number of chatgpt users, January 2024.
- Holly Else. Abstracts written by ChatGPT fool scientists. Nature, 613(7944):423–423, January 2023.
- Michael Elsen-Rooney. Nyc education department blocks chatgpt on school devices, networks, January 2023.
- Robin A Emsley. Chatgpt: these are not hallucinations – they’re fabrications and falsifications. Schizophrenia, 9, 2023.
- Gltr: Statistical detection and visualization of generated text. CoRR, abs/1906.04043, 2019.
- How close is chatgpt to human experts? comparison corpus, evaluation, and detection, 2023.
- Robust fake news detection over time and attack. ACM Transactions on Intelligent Systems and Technology (TIST), 11(1):1–23, 2019.
- Automatic detection of machine generated text: A critical survey. In International Conference on Computational Linguistics, 2020.
- The use of large language models to generate education materials about uveitis. Ophthalmology Retina, 2023.
- Shalom Lappin. Assessing the strengths and weaknesses of large language models. Journal of Logic, Language and Information, pages 1–12, 11 2023.
- Large language models understand and can be enhanced by emotional stimuli, 2023.
- Differentiate chatgpt-generated and human-written medical texts, 2023.
- Glenda M. McCLURE. Readability formulas: Useful or useless? IEEE Transactions on Professional Communication, PC-30:12–15, 1987.
- Chatgpt or human? detect and explain. explaining decisions of machine learning model for detecting short chatgpt-generated text, 2023.
- Jo Ann Oravec. Artificial intelligence implications for academic cheating: Expanding the dimensions of responsible human-ai collaboration with chatgpt and bard. 2023.
- Threat scenarios and best practices to detect neural fake news. In Proceedings of the 29th International Conference on Computational Linguistics, pages 1233–1249. International Committee on Computational Linguistics, October 2022.
- Linguistic models for analyzing and detecting biased language. In Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), volume 1, pages 1650–1659, 2013.
- Cross-domain detection of GPT-2-generated technical text. In Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 1213–1233, Seattle, United States, July 2022. Association for Computational Linguistics.
- Release strategies and the social impacts of language models. ArXiv, abs/1908.09203, 2019.
- Large Language Model Advanced Data Analysis Abuse to Create a Fake Data Set in Medical Research. JAMA Ophthalmology, 141(12):1174–1175, 12 2023.
- The science of detecting llm-generated texts, 2023.
- The Lancet Digital Health. Large language models: a new chapter in digital health. The Lancet Digital Health, 6(1):e1, 2024.
- Lakshmi Varanasi. Gpt-4 can ace the bar, but it only has a decent chance of passing the cfa exams. here’s a list of difficult exams the chatgpt and gpt-4 have passed., Nov 2023.
- Emotional intelligence of large language models, 2023.
- A survey on detection of llms-generated content. ArXiv, abs/2310.15654, 2023.
- Cheat: A large-scale dataset for detecting chatgpt-written abstracts, 2023.
- Defending against neural fake news. ArXiv, abs/1905.12616, 2019.
- Yaqi Xie (23 papers)
- Anjali Rawal (2 papers)
- Yujing Cen (1 paper)
- Dixuan Zhao (1 paper)
- Shanu Sushmita (3 papers)
- Sunil K Narang (2 papers)