Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
143 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
46 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

BERT-Enhanced Retrieval Tool for Homework Plagiarism Detection System (2404.01582v2)

Published 1 Apr 2024 in cs.CL, cs.AI, and cs.IR

Abstract: Text plagiarism detection task is a common natural language processing task that aims to detect whether a given text contains plagiarism or copying from other texts. In existing research, detection of high level plagiarism is still a challenge due to the lack of high quality datasets. In this paper, we propose a plagiarized text data generation method based on GPT-3.5, which produces 32,927 pairs of text plagiarism detection datasets covering a wide range of plagiarism methods, bridging the gap in this part of research. Meanwhile, we propose a plagiarism identification method based on Faiss with BERT with high efficiency and high accuracy. Our experiments show that the performance of this model outperforms other models in several metrics, including 98.86\%, 98.90%, 98.86%, and 0.9888 for Accuracy, Precision, Recall, and F1 Score, respectively. At the end, we also provide a user-friendly demo platform that allows users to upload a text library and intuitively participate in the plagiarism analysis.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (12)
  1. Paraphrase type identification for plagiarism detection using contexts and word embeddings. International Journal of Educational Technology in Higher Education, 18(1):42, 2021.
  2. Principled instructions are all you need for questioning llama-1/2, gpt-3.5/4. arXiv preprint arXiv:2312.16171, 2023.
  3. Developing a corpus of plagiarised short answers. Language resources and evaluation, 45:5–24, 2011.
  4. The faiss library. 2024.
  5. Novelty detection: A perspective from natural language processing. Computational Linguistics, 48(1):77–117, 2022.
  6. Billion-scale similarity search with gpus. IEEE Transactions on Big Data, 7(3):535–547, 2019.
  7. On the importance of data size in probing fine-tuned models. arXiv preprint arXiv:2203.09627, 2022.
  8. Is a prompt and a few samples all you need? using gpt-4 for data augmentation in low-resource classification tasks. arXiv preprint arXiv:2304.13861, 2023.
  9. Representation biases in sentence transformers. arXiv preprint arXiv:2301.13039, 2023.
  10. Replug: Retrieval-augmented black-box language models. arXiv preprint arXiv:2301.12652, 2023.
  11. Mpnet: Masked and permuted pre-training for language understanding. Advances in Neural Information Processing Systems, 33:16857–16867, 2020.
  12. Paraphrase identification with deep learning: A review of datasets and methods. arXiv preprint arXiv:2212.06933, 2022.

Summary

We haven't generated a summary for this paper yet.