- The paper presents a robust open citation index integrating over 2 billion unique citation links from multiple authoritative sources.
- It introduces a novel deduplication mechanism that maps diverse bibliographic identifiers to globally unique OMIDs, ensuring data integrity.
- The index supports transparent research with user-friendly access via SPARQL endpoints, REST APIs, and intuitive web interfaces.
The OpenCitations Index
The paper "The OpenCitations Index" by Ivan Heibi, Arianna Moretti, Silvio Peroni, and Marta Soricetti provides a comprehensive overview of a crucial infrastructure developed by OpenCitations: the OpenCitations Index. This index represents an extensive collection of open citation data, offering a rigorously processed and openly accessible dataset that can be utilized to foster transparency and reproducibility in academic research.
Core Contributions
The OpenCitations Index integrates citation data from multiple authoritative sources, including Crossref, NIH Open Citation Collection (NIH-OCC), DataCite, OpenAIRE, and the Japan Link Center (JaLC). As of July 2024, the index includes over 2 billion unique citation links, providing a vast resource for the academic community.
Deduplication Mechanism
A significant methodological advancement presented in this paper is the deduplication mechanism, which addresses the issue of varying identifiers for the same bibliographic entities across different sources. This process involves preprocessing source data, managing bibliographic metadata, and generating new citation data. The deduplication mechanism ensures that each citation integrated into the OpenCitations Index is uniquely identified, thereby maintaining data integrity across disparate data sources.
Methodological Workflow
The paper details a meticulous workflow designed for the efficient ingestion of citation data:
- Source Preprocess:
- Extraction of data from original sources.
- Production of CSV tables with bibliographic metadata and citation data.
- Meta Process:
- Mapping external persistent identifiers to a globally unique identifier (OMID).
- Integration with the OpenCitations Meta collection to deduplicate entities.
- Index Process:
- Conversion of citation links to an OMID-to-OMID format.
- Generation of comprehensive datasets and updating the OpenCitations Index graph database.
Data Representation and Provenance
All citation data are modeled according to the OpenCitations Data Model (OCDM), which uses Semantic Web technologies. The OCDM represents citations as first-class entities with detailed metadata, including the citing and cited entities, citation creation date, citation timespan, and type of citation (e.g., author self-citation, journal self-citation).
Provenance and change tracking are integral to the dataset, ensuring transparency and traceability. The index captures the validity and invalidity dates, responsible agents, primary data sources, and update queries. Additionally, the dataset is described using VoID and DCAT vocabularies, enhancing interoperability.
Access and Usage
The OpenCitations Index can be accessed through several services:
- SPARQL Endpoint: Allows complex queries using SPARQL.
- REST API: Provides a straightforward way to access data programmatically.
- Web Interfaces: Includes tools like YASGUI, OSCAR, and LUCINDA for searching, querying, and browsing data.
The index's openness and accessibility are reinforced by its release under a CC0 waiver, ensuring that data can be freely used, transformed, and integrated into other systems without restriction.
The OpenCitations Index has achieved significant uptake within the academic and research communities. Notable initiatives utilizing this index include OpenAIRE-Nexus, GraspOS, B!SON, PURE Suggest, Open Access Helper, ORBi, CHERRY, and StabiKat. These projects leverage citation data to enhance research assessment, journal recommendation, open access availability, and bibliometric analysis.
Future Directions
Future developments aim to further refine the quality of data in the OpenCitations Index. Key initiatives include the implementation of HERITRACE, a semantic data management system for human curation of citation data, and the integration of machine learning techniques for author disambiguation.
In conclusion, the OpenCitations Index represents a robust, meticulously curated collection of open citation data. Its comprehensive methodology, extensive dataset, and wide-ranging accessibility significantly contribute to advancing open scholarship, enabling transparent and reproducible research practices in the global academic community.