Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
102 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

All That Glitters is Not Novel: Plagiarism in AI Generated Research (2502.16487v2)

Published 23 Feb 2025 in cs.CL

Abstract: Automating scientific research is considered the final frontier of science. Recently, several papers claim autonomous research agents can generate novel research ideas. Amidst the prevailing optimism, we document a critical concern: a considerable fraction of such research documents are smartly plagiarized. Unlike past efforts where experts evaluate the novelty and feasibility of research ideas, we request $13$ experts to operate under a different situational logic: to identify similarities between LLM-generated research documents and existing work. Concerningly, the experts identify $24\%$ of the $50$ evaluated research documents to be either paraphrased (with one-to-one methodological mapping), or significantly borrowed from existing work. These reported instances are cross-verified by authors of the source papers. Experts find an additional $32\%$ ideas to partially overlap with prior work, and a small fraction to be completely original. Problematically, these LLM-generated research documents do not acknowledge original sources, and bypass inbuilt plagiarism detectors. Lastly, through controlled experiments we show that automated plagiarism detectors are inadequate at catching plagiarized ideas from such systems. We recommend a careful assessment of LLM-generated research, and discuss the implications of our findings on academic publishing.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (2)
  1. Tarun Gupta (16 papers)
  2. Danish Pruthi (28 papers)

Summary

The paper "All That Glitters is Not Novel: Plagiarism in AI Generated Research" addresses the critical issue of plagiarism in research documents generated by LLMs. The authors conduct an expert-led evaluation to identify similarities between LLM-generated research documents and existing work. They find that a significant fraction of these documents are either paraphrased or significantly borrowed from existing work without proper acknowledgment. Furthermore, the paper demonstrates that automated plagiarism detectors are often inadequate at catching deliberately plagiarized ideas from LLMs.

The paper's methodology involves generating a set of research proposals using the code from Si et al. [si2024can]. These proposals, along with exemplar papers from Lu et al. [lu2024ai] and proposals from Si et al. [si2024can], are evaluated by $13$ experts. The experts are instructed to presume plagiarism and actively search for it in the LLM-generated research documents. They assign a score from $1$ to $5$ based on the similarity between the LLM-generated content and existing work, where a score of $5$ indicates direct copying and a score of $4$ indicates significant borrowing. The authors then verify the instances of plagiarism with the authors of the source papers.

The experimental setup is illustrated in Figure 1, which shows the steps from generating research proposals to expert assessment and verification. The evaluation included $12$ natural language processing topics (Table 5), resulting in $36$ research proposals generated from Si et al. [si2024can], $4$ research proposals showcased in Si et al. [si2024can], and $10$ research papers showcased in Lu et al. [lu2024ai], totaling $50$ research documents. Experts identify source papers and assign scores from $1$ to $5$, with scores of $4$ or higher indicating plagiarism. The authors of the source papers verify these reported instances.

The results of the expert-led analysis reveal that 24.0%24.0\% of the LLM-generated research documents are plagiarized. This includes 14.0%14.0\% with a score of $5$ and 10.0%10.0\% with a score of $4$. When including claims where source paper authors are unreachable, these numbers increase to 18.0%18.0\% for each score, amounting to 36.0%36.0\% of the examined proposals. The authors also find that several previously showcased exemplars of LLM-generated research are plagiarized.

The authors also evaluate the effectiveness of automated plagiarism detection methods. They create a synthetic dataset of research proposals intentionally plagiarized from existing papers and evaluate common automated detection methods, including Semantic Scholar Augmented Generation (SSAG), OpenScholar, and a commercial service (Turnitin). The results show that these methods are inadequate for detecting plagiarism in LLM-generated research proposals.

The related work section discusses previous studies on the novelty of LLM-generated research and automated plagiarism detection tools. It notes that previous studies evaluate novelty using automated LLM-based judges or rely on small sets of experts. The paper also discusses studies that integrate LLMs with academic search engines to detect plagiarism. The paper references OpenScholar, a retrieval-augmented LM (LLM) that leverages a database of $45$ million open-access papers with $237$ million passage embeddings, as a potential tool for embedding-based plagiarism detection. The related work also examines recent research exploring LLMs' capabilities in various research tasks, such as predicting experimental outcomes, conducting research experiments, paper reviewing, and related work generation.

The methodology used to generate research proposals is described in detail, with particular attention to the plagiarism detection module. The process consists of six sequential steps, with Claude $3.5$ Sonnet being the backbone LLM. The first step uses a retrieval-augmented generation (RAG) system to retrieve and rank relevant papers using the Semantic Scholar API (Application Programming Interface). The second step generates initial seed ideas using these retrieved papers, while the third step involves deduplication using text embeddings. The fourth step expands these seed ideas into detailed project proposals, and the fifth step implements a Swiss tournament system to identify the strongest candidates. The final step attempts to detect potential plagiarism through Semantic Scholar Augmented Generation (SSAG).

The paper presents a case paper of a research proposal titled "Semantic Resonance Uncertainty Quantification," which appears to be plagiarized from an existing paper, "Generating with Confidence: Uncertainty Quantification for Black-box LLMs" [lin2023generating]. The proposed methodology exhibits a clear one-to-one mapping with the original paper, with each component of the LLM-generated proposal corresponding to specific sections in the source paper. This case paper supports the thesis that LLM-generated research ideas may not be as novel as previously thought and that experts may be fooled into considering them novel without a skeptical eye for plagiarism.

The authors also discuss the implications of their findings for academic publishing. They suggest that the widespread adoption of LLM tools could lead to an increase in publications with improper citations or inadvertent plagiarism. They also note that the sophisticated nature of the plagiarism would require conference and journal reviewers to spend more time searching for potential content misappropriation.

The paper notes that the existing automated plagiarism detection methods are inadequate and that manual evaluation by domain experts is time-consuming and laborious. Future work should be aimed towards developing methods to identify candidate source papers. The authors also suggest exploring post-training strategies that could help reduce plagiarism in LLM-generated research content and examine whether LLM-generated content directly copies or significantly borrows content from copyrighted materials.

The paper identifies a few limitations. First, the paper identifies the challenge of automating the detection of original papers that may have been plagiarized. Second, the expert evaluation design may introduce confirmation bias, leading them to give higher similarity scores in order to complete their task. To mitigate this, the authors had authors of source-papers verify and adjust the scores downwards when relevant. Third, in the automated plagiarism detection experiments, the authors limit Semantic Scholar queries to a maximum of $5$ iterations, fewer than the $10$ iterations employed in some previous studies.

In conclusion, the authors present a systematic paper of plagiarism in LLM-generated research documents. The paper reveals significant levels of plagiarism in these documents and demonstrates the inadequacy of current automated plagiarism detection methods. The findings raise important concerns about the potential wide-spread use of LLMs for research ideation and highlight the need for better plagiarism detection methods.