Deep Learning Detection Method for Large Language Models-Generated Scientific Content (2403.00828v1)
Abstract: LLMs, such as GPT-3 and BERT, reshape how textual content is written and communicated. These models have the potential to generate scientific content that is indistinguishable from that written by humans. Hence, LLMs carry severe consequences for the scientific community, which relies on the integrity and reliability of publications. This research paper presents a novel ChatGPT-generated scientific text detection method, AI-Catcher. AI-Catcher integrates two deep learning models, multilayer perceptron (MLP) and convolutional neural networks (CNN). The MLP learns the feature representations of the linguistic and statistical features. The CNN extracts high-level representations of the sequential patterns from the textual content. AI-Catcher is a multimodal model that fuses hidden patterns derived from MLP and CNN. In addition, a new ChatGPT-Generated scientific text dataset is collected to enhance AI-generated text detection tools, AIGTxt. AIGTxt contains 3000 records collected from published academic articles across ten domains and divided into three classes: Human-written, ChatGPT-generated, and Mixed text. Several experiments are conducted to evaluate the performance of AI-Catcher. The comparative results demonstrate the capability of AI-Catcher to distinguish between human-written and ChatGPT-generated scientific text more accurately than alternative methods. On average, AI-Catcher improved accuracy by 37.4%.
- Generating sentiment-preserving fake online reviews using neural language models and their human-and machine-based detection. In Advanced Information Networking and Applications: Proceedings of the 34th International Conference on Advanced Information Networking and Applications (AINA-2020) (pp. 1341–1354). Springer.
- An integrated approach for intrinsic plagiarism detection. Future Generation Computer Systems, 96, 700–712.
- Automatic plagiarism detection in obfuscated text. Pattern Analysis and Applications, 23, 1627–1650.
- Paraphrase type identification for plagiarism detection using contexts and word embeddings. International Journal of Educational Technology in Higher Education, 18, 1–25.
- Use prompt to differentiate text generated by ChatGPT and humans. Machine Learning with Applications, 14, 100497.
- Improving plagiarism detection in text document using hybrid weighted similarity. Expert Systems with Applications, 207, 118034.
- Learning semantic coherence for machine generated spam text detection. In 2019 International Joint Conference on Neural Networks (IJCNN) (pp. 1–8). IEEE.
- Rapamycin in the context of pascal’s wager: generative pre-trained transformer perspective. Oncoscience, 9, 82–84.
- STADEE: STAtistics-based DEEp detection of machine generated text. In Proceedings of the 2023 International Conference on Intelligent Computing (pp. 732–743). Springer Nature Singapore.
- Ai-generated research paper fabrication and plagiarism in the scientific community. Patterns, 4, 100706.
- Tweepfake: About detecting deepfake tweets. Plos one, 16, e0251415.
- Academic plagiarism detection: a systematic literature review. ACM Computing Surveys (CSUR), 52, 1–42.
- On pushing deepfake tweet detection capabilities to the limits. In Proceedings of the 14th ACM Web Science Conference 2022 (pp. 154–163). Association for Computing Machinery.
- A deep learning approach to persian plagiarism detection. FIRE (Working Notes), 34, 154–159.
- Citation-based plagiarism detection. Springer.
- GPTZero (2023). https://gptzero.me. access date: 27/8/2023.
- Accurate generated text detection based on deep layer-wise relevance propagation. In 2023 IEEE 8th International Conference on Big Data Analytics (ICBDA) (pp. 215–223). IEEE.
- Computer-based plagiarism detection methods and tools: an overview. In Proceedings of the 2007 international conference on Computer systems and technologies (pp. 1–6). Association for Computing Machinery.
- An adaptive meta-heuristic for music plagiarism detection based on text similarity and clustering. Data Mining and Knowledge Discovery, 36, 1301–1334.
- Plagiarism-a survey. J. Univers. Comput. Sci., 12, 1050–1084.
- An adaptive image-based plagiarism detection approach. In Proceedings of the 18th ACM/IEEE on Joint Conference on Digital Libraries (pp. 131–140). Association for Computing Machinery.
- Hyplag: A hybrid approach to academic plagiarism detection. In The 41st International ACM SIGIR Conference on Research & Development in Information Retrieval (pp. 1321–1324).
- O’Connor, S. et al. (2023). Open artificial intelligence platforms in nursing education: Tools for academic progress or abuse? Nurse Education in Practice, 66, 103537.
- Deepfake detection on social media: Leveraging deep learning and FastText embeddings for identifying machine-generated tweets. IEEE Access, 11, 95008–95021.
- Automated identification of social media bots using deepfake text detection. In International Conference on Information Systems Security (pp. 111–123). Springer.
- Convgrutext: a deep learning method for fake text detection on online social media. In 24th Pacific Asia Conference on Information Systems.
- Attribution and obfuscation of neural text authorship: A data mining perspective. ACM SIGKDD Explorations Newsletter, 25, 1–18.
- Authorship attribution for neural text generation. In Proceedings of the 2020 conference on empirical methods in natural language processing (EMNLP) (pp. 8384–8395).
- Multi-level text document similarity estimation and its application for plagiarism detection. Iran Journal of Computer Science, 5, 143–155.
- Defining author’s style for plagiarism detection in academic environment. In 2018 IEEE Second International Conference on Data Stream Mining & Processing (DSMP) (pp. 128–133). IEEE.
- Wager, E. (2014). Defining and responding to plagiarism. Learned publishing, 27, 33–42.
- Writer (2023). https://writer.com/ai-content-detector. access date: 27/8/2023.
- An external plagiarism detection system based on part-of-speech (POS) tag n-grams and word embedding. Expert Systems with Applications, 197, 116677.
- Xlnet: Generalized autoregressive pretraining for language understanding. Advances in neural information processing systems, 32.
- ZeroGPT (2023). https://www.zerogpt.com. access date: 27/8/2023.
- Character-level convolutional networks for text classification. Advances in neural information processing systems, 28.
- Multi-agents indexing system (mais) for plagiarism detection. Journal of King Saud University-Computer and Information Sciences, 34, 2131–2140.
- Bushra Alhijawi (2 papers)
- Rawan Jarrar (1 paper)
- Aseel AbuAlRub (1 paper)
- Arwa Bader (1 paper)