Detection of tortured phrases in scientific literature (2402.03370v1)
Abstract: This paper presents various automatic detection methods to extract so called tortured phrases from scientific papers. These tortured phrases, e.g. flag to clamor instead of signal to noise, are the results of paraphrasing tools used to escape plagiarism detection. We built a dataset and evaluated several strategies to flag previously undocumented tortured phrases. The proposed and tested methods are based on LLMs and either on embeddings similarities or on predictions of masked token. We found that an approach using token prediction and that propagates the scores to the chunk level gives the best results. With a recall value of .87 and a precision value of .61, it could retrieve new tortured phrases to be submitted to domain experts for validation.
- Scibert: A pretrained language model for scientific text. arXiv preprint arXiv:1903.10676.
- Tortured phrases: A dubious writing style emerging in science. evidence of critical issues affecting established journals. CoRR, abs/2107.06751.
- Guillaume Cabanac and Cyril Labbé. 2021. Prevalence of nonsensical algorithmically generated papers in the scientific literature. Journal of the Association for Information Science and Technology, 72(12):1461–1476.
- BERT: pre-training of deep bidirectional transformers for language understanding. CoRR, abs/1810.04805.
- Gltr: Statistical detection and visualization of generated text. arXiv preprint arXiv:1906.04043.
- Dimensions: Bringing down barriers between scientometricians and data. Quantitative Science Studies, 1(1):387–395.
- Investigating the detection of tortured phrases in scientific literature. In Proceedings of the Third Workshop on Scholarly Document Processing, SDP@COLING 2022, Gyeongju, Republic of Korea, October 12 - 17, 2022, pages 32–36. Association for Computational Linguistics.
- Glove: Global vectors for word representation. In Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP), pages 1532–1543.
- Accurate detection of automatically spun content via stylometric analysis. In 2017 ieee international conference on data mining (icdm), pages 425–434. IEEE.
- Identifying machine-paraphrased plagiarism. In International Conference on Information, pages 393–413. Springer.
- Dspin: Detecting automatically spun content on the web. In NDSS.