How Large Language Models are Transforming Machine-Paraphrased Plagiarism (2210.03568v3)

Published 7 Oct 2022 in cs.CL and cs.AI

Abstract: The recent success of LLMs for text generation poses a severe threat to academic integrity, as plagiarists can generate realistic paraphrases indistinguishable from original work. However, the role of large autoregressive transformers in generating machine-paraphrased plagiarism and their detection is still developing in the literature. This work explores T5 and GPT-3 for machine-paraphrase generation on scientific articles from arXiv, student theses, and Wikipedia. We evaluate the detection performance of six automated solutions and one commercial plagiarism detection software and perform a human study with 105 participants regarding their detection performance and the quality of generated examples. Our results suggest that large models can rewrite text humans have difficulty identifying as machine-paraphrased (53% mean acc.). Human experts rate the quality of paraphrases generated by GPT-3 as high as original texts (clarity 4.0/5, fluency 4.2/5, coherence 3.8/5). The best-performing detection model (GPT-3) achieves a 66% F1-score in detecting paraphrases.

PDF Abstract

This paper investigates the challenge posed by LLMs like T5 and GPT-3 in generating machine-paraphrased text that can evade plagiarism detection (Wahle et al., 2022 ). It explores the generation of such paraphrases, evaluates human and automated detection capabilities, and assesses the quality of the generated text.

Paraphrase Generation Methodology

Models Used: T5 (up to 11B parameters) and GPT-3 (up to 175B parameters).

Generation Technique: Few-shot learning was employed. The models were prompted with examples of original text and corresponding human paraphrases, followed by the text to be paraphrased. AutoPrompt was used to optimize the task instructions (e.g., "Rephrase the following paragraph while keeping its meaning:").

# Example Prompt Structure (Conceptual)
Prompt: "Rephrase the following paragraph while keeping its meaning:"
Example 1 Original: [Original Text 1]
Example 1 Paraphrased: [Human Paraphrase 1]
...
Example N Original: [Original Text N]
Example N Paraphrased: [Human Paraphrase N]
Text to Paraphrase: [Input Text]
Model Output: [Generated Paraphrase]

Candidate Selection: To generate high-quality, non-trivial paraphrases, multiple candidates were generated for each input. The best candidate was selected based on Pareto optimality, aiming to:
- Maximize semantic similarity (using BERTScore and BARTScore).
- Minimize lexical overlap (using ROUGE-L and BLEU). This selects paraphrases that retain the original meaning but use different wording and sentence structures.
Dataset: A dataset of 200,000 paraphrased examples was created using the best configurations of T5 and GPT-3. The original texts were sourced from:
- arXiv preprints (20,966)
- Wikipedia articles (39,241)
- Student graduation theses (5,226) Human paraphrase examples for few-shot learning were drawn from P4P and PPDB 2.0 datasets.

Evaluation and Findings

Impact of Model Size: Larger models generally produce paraphrases with higher semantic similarity to the original text. GPT-3 (175B) achieved the highest semantic scores. However, larger models also tend to produce text with higher lexical overlap unless filtered using the candidate selection criteria.
Human Detection:
- A paper with 105 participants (AMT workers and experts) evaluated the ability to distinguish original text from machine-paraphrased text.
- Humans could identify paraphrases from simpler tools like SpinnerChief with high accuracy (82%).
- However, accuracy dropped significantly for LLM-generated paraphrases:
  - T5 (11B): 56% accuracy
  - GPT-3 (175B): 53% accuracy (barely above chance)
- This highlights the difficulty humans face in detecting sophisticated machine paraphrasing.
Quality Assessment:
- Participants rated paraphrases on clarity, fluency, and coherence (1-5 Likert scale).
- GPT-3 (175B) paraphrases received ratings nearly as high as original human-written text (Clarity: 3.92 vs 3.98, Fluency: 3.60 vs 4.21, Coherence: 3.72 vs 3.81).
- Paraphrases from simpler tools (SpinnerChief) were rated significantly lower.
Automated Detection:
- Various automated methods were tested: PlagScan (commercial software), Naive Bayes + word2vec, BERT, RoBERTa, Longformer, T5 (11B), and GPT-3 (175B).
- Performance on Simpler Paraphrases (SpinnerChief, BERT): Neural models (BERT, RoBERTa, Longformer, T5, GPT-3) performed well, significantly outperforming PlagScan and the Naive Bayes approach. T5 and GPT-3 achieved the highest F1 scores (up to 87%).
- Performance on GPT-3 Paraphrases: Most detection methods struggled significantly.
  - PlagScan, Naive Bayes, BERT, RoBERTa, and Longformer performed poorly, often close to random chance (around 50% F1).
  - Humans also performed poorly (50-55% accuracy).
  - The models used for generation performed best, but still with modest success: T5 detector achieved ~60-63% F1, and the GPT-3 detector achieved ~64-66% F1.
- This suggests that detecting LLM-generated paraphrases may require using similarly powerful LLMs, potentially those fine-tuned specifically for the detection task.

Practical Implications and Implementation Considerations

Challenge to Academic Integrity: LLMs significantly increase the threat of undetectable plagiarism. Paraphrases are high-quality and difficult for both humans and existing tools to identify.
Inadequacy of Current Tools: Traditional plagiarism detection software relying heavily on lexical matching (like PlagScan in these tests) is largely ineffective against LLM paraphrasing techniques that alter sentence structure and word choice while preserving meaning.
Need for Advanced Detection: Effective detection likely requires semantic understanding. The results suggest that models similar in architecture and scale to the generating models (e.g., using large autoregressive transformers) might be necessary. Implementing such detectors would involve:
- Fine-tuning LLMs (like T5 or potentially smaller, optimized models) on datasets containing original/LLM-paraphrased pairs.
- Using few-shot prompting with LLMs like GPT-3 for detection, similar to how generation was performed but framed as a classification task.
Dataset Availability: The paper provides a valuable dataset (https://github.com/jpwahle/emnlp22-transforming) specifically designed for training and evaluating detectors for LLM-paraphrased text.
Computational Cost: Implementing LLM-based detectors (especially using models like GPT-3 or large T5) requires significant computational resources for inference, which could be a barrier for widespread deployment.
Ethical Concerns: False positives in plagiarism detection have severe consequences. Any automated system, especially sophisticated ones, must be used cautiously, likely as a tool to flag suspicious text for expert human review rather than for automated judgment.

In conclusion, the paper demonstrates that LLMs can produce high-quality paraphrases that are extremely challenging to detect using current methods. This necessitates the development of new, more sophisticated detection approaches, potentially leveraging LLMs themselves, and highlights the urgent need for the academic community to address this evolving form of plagiarism.

PDF Markdown Bookmark Chat (Pro)

Authors (4)

Jan Philip Wahle (31 papers)
Terry Ruas (46 papers)
Frederic Kirstein (8 papers)
Bela Gipp (98 papers)

Citations (25)

View on Semantic Scholar

Related Papers

Find Related Papers

YouTube

Show All Videos