Detection of metadata manipulations: Finding sneaked references in the scholarly literature (2501.03771v1)

Published 7 Jan 2025 in cs.DL

Abstract: We report evidence of a new set of sneaked references discovered in the scientific literature. Sneaked references are references registered in the metadata of publications without being listed in reference section or in the full text of the actual publications where they ought to be found. We document here 80,205 references sneaked in metadata of the International Journal of Innovative Science and Research Technology (IJISRT). These sneaked references are registered with Crossref and all cite -- thus benefit -- this same journal. Using this dataset, we evaluate three different methods to automatically identify sneaked references. These methods compare reference lists registered with Crossref against the full text or the reference lists extracted from PDF files. In addition, we report attempts to scale the search for sneaked references to the scholarly literature.

Summary

The paper introduces three detection methods—list length comparison, last item, and full text matching—to uncover metadata manipulation.
The study reveals over 80,000 sneaked references benefiting one journal, with some DOIs receiving more than 6,000 unwarranted citations.
The findings underscore risks to citation metrics and call for robust metadata verification to preserve academic integrity.

Detection of Metadata Manipulations: Sneaked References in Scholarly Literature

The academic paper titled "Detection of Metadata Manipulations: Finding Sneaked References in the Scholarly Literature" presents a thorough investigation into the phenomenon of sneaked references—a form of citation gaming that manipulates metadata to inflate citation metrics artificially. Focusing heavily on the International Journal of Innovative Science and Research Technology (IJISRT), the paper identifies an astounding 80,205 sneaked references benefiting this publication exclusively. These references are embedded within the metadata registered with Crossref, yet conspicuously absent from the reference sections or full texts of the corresponding articles.

Methodology and Key Techniques

The paper evaluates three distinct methods to identify sneaked references:

List Length Comparison Method ( $\mathcal{M}_0$ ): Initially proposed in previous work, this method estimates sneaked references by comparing the length of the registered reference list with extracted lists. However, this technique risks overestimating due to its reliance on length discrepancies alone.
Last Item Comparative Method ( $\mathcal{M}_1$ ): This method involves ordering the Crossref list against those extracted from PDF files via Grobid. By assessing the last items in these lists, sneaked references are identified if the registered list evidently contains extra references. While promising, this method's efficacy largely hinges on the accurate extraction of references and may necessitate subsequent cleaning for truncations or hallucinations.
Full Text Matching Method ( $\mathcal{M}_2$ ): Considered the most adept approach, this method matches each element in the Crossref record with strings from the full text of PDF files, employing the Levenshtein distance for partial matches. It effectively identifies sneaked references by pinpointing elements that lack corresponding entries in the text body.

Results and Observations

Analyzing the IJISRT dataset, the research uncovered a consistent pattern wherein sneaked references predominantly benefit DOIs within the same journal. Temporally, these references were added between March and November 2024. While most papers benefit from minimal manipulation (often a single reference), outliers exist—with one DOI receiving over 6,000 undue citations.

The exploratory analysis over a broader dataset discovered a total of 4,172,499 cases where PDF-extracted reference counts did not match Crossref entries. However, limitations in matching processes due to formatting inconsistencies and supplement references highlighted the need for refined detection methodologies.

Implications and Future Directions

This work raises significant concerns about the integrity of citation-based metrics, which are pivotal in academic assessment frameworks such as the Journal Impact Factor and $h$ -index. The paper calls for vigilant monitoring and the development of robust detection mechanisms to mitigate citation gaming behaviors. Proposed interventions include enhancing editorial tools for metadata submissions, deploying automated deduplication post-registration, and leveraging tools like Grobid for systematic cross-checking.

Future research must endeavor to expand the detection of sneaked references across the wider array of literature, investigating patterns and broader implications. This includes correlating problematic metadata with major scientometric platforms to ascertain the extent of impact on bibliometric scores.

The paper concludes with a strong emphasis on collaborative rectification efforts, urging Crossref members to ensure accurate metadata registration. Crossref has begun revoking memberships of organizations involved in such manipulative practices, maintaining the integrity of scholarly communication. Through continuous vigilance and methodological advancements, the academic community can better safeguard against citation manipulation, maintaining the reliability of academic metrics.