OpenCitations Index

Updated 2 June 2026

OpenCitations Index is an openly licensed, machine‐readable repository that curates scholarly citation relationships using Semantic Web standards like RDF and CiTO.
It aggregates and normalizes citation data from multiple sources such as Crossref, NIH-OCC, and DataCite through a federated pipeline, ensuring deduplication and detailed provenance.
The Index offers APIs, SPARQL endpoints, and bulk data dumps to enhance research transparency, reproducibility, and accessibility in bibliometrics worldwide.

The OpenCitations Index is an openly licensed, machine-readable repository of scholarly citation relationships, curated by the independent non-profit organization OpenCitations. Conceived to provide an open, standards-based alternative to proprietary citation indexes, the OpenCitations Index publishes citation data as Linked Open Data using Semantic Web technologies. The resource advances transparency, reproducibility, and accessibility in bibliometrics and research assessment by exposing the underlying citation graph—including both provenance and rich metadata—without license restrictions on reuse (Heibi et al., 2024, Peroni et al., 2019).

1. Conceptual Model and Data Representation

Each entry in the OpenCitations Index is a citation treated as a first-class data entity, distinct from legacy triple-only models. Citations are assigned persistent Open Citation Identifiers (OCI) and described formally in RDF using the OpenCitations Data Model (OCDM). The OCDM reifies citations as instances of cito:Citation (from the Citation Typing Ontology, CiTO), linking two bibliographic entities (usually identified by DOIs, PMIDs, or ISBNs) with additional properties such as citation creation date (cito:hasCitationCreationDate), citation timespan (cito:hasCitationTimeSpan), and characterisation (e.g., self-citation type) (Heibi et al., 2024, Heibi et al., 2019, Peroni et al., 2019).

Example RDF Turtle representation:

@prefix cito: <http://purl.org/spar/cito/> .
@prefix datacite: <http://purl.org/spar/datacite/> .
@prefix xsd: <http://www.w3.org/2001/XMLSchema#> .

<oci:06101801781-06180334099>
    a cito:Citation ;
    cito:hasCitingEntity <omid:br/06101801781> ;
    cito:hasCitedEntity  <omid:br/06180334099> ;
    cito:hasCitationCreationDate "2021-03-10"^^xsd:date ;
    cito:hasCitationTimeSpan "P6Y0M1D"^^xsd:duration .

The data model leverages the SPAR suite of ontologies (CiTO, FaBiO, BiRO, C4O, PROV-O, DataCite, Dublin Core) for full semantic expressiveness (Peroni et al., 2019).

2. Aggregation Pipeline and Sources

The OpenCitations Index integrates citation data from a federated pipeline comprising multiple open sources. As of July 2024, primary sources include: Crossref, NIH Open Citation Collection (NIH-OCC), DataCite, OpenAIRE ScholeXplorer, and the Japan Link Center (JaLC). Each source is periodically harvested, its metadata normalized, and its citation pairs deduplicated (Heibi et al., 2024).

Data Workflow

Source Preprocessing: Source-specific parsers extract citation pairs and normalize all identifiers (DOIs to lowercase, ISSNs to correct format, PMIDs syntax-checked). Each source yields a bibliographic metadata table and a citation pairs table.
Meta Layer: All bibliographic entities are consolidated in OpenCitations Meta, which mints a globally unique OpenCitations Meta Identifier (OMID) for each distinct entity. This step enables cross-source deduplication even if different source-specific identifiers are used for the same work.
Index Construction: All citations are mapped as (OMID₁, OMID₂) pairs. Duplicate citations across sources (same citing and cited OMID) are collapsed, with provenance snapshots tracked per occurrence. Unique citations are assigned OCIs of the form oci:NNNNNNNNNN-MMMMMMMMMM, encoding the involved OMIDs (Heibi et al., 2024).

3. Scale, Coverage, and Growth

The OpenCitations Index has exhibited rapid expansion. As of July 2024:

Over 2 billion unique citation links are present, referencing 91,380,000 unique bibliographic entities (≈72.8M as citing, ≈74.2M as cited).
Source contributions are:
- Crossref: 1.60 billion total, 1.14 billion exclusive
- NIH-OCC: ~696 million total, 230 million exclusive
- DataCite: ~170 million total and exclusive
- OpenAIRE: 14.6 million total, 4.46 million exclusive
- JaLC: ~397,000 total, 395,000 exclusive

The average monthly growth is approximately 30 million newly ingested citation links, with quarterly dataset releases and incremental updates via the APIs and SPARQL endpoints (Heibi et al., 2024).

Coverage analyses for national research information systems (IRIS instances in Italy) report that, across six universities, OpenCitations Meta (the entity backbone of the Index) covers on average over 40% of CRIS-registered outputs—on par with, or exceeding, the proprietary Scopus and Web of Science for STEM outputs. However, coverage remains lower (<10%) for publication types prevalent in the Social Sciences and Humanities (SSH) (Andreose et al., 31 Jan 2026, Andreose et al., 10 Jan 2025).

Institution	CRIS Records	OC Meta Matches	Coverage (%)
UNIBO	402,505	165,500	42.7
UNIPD	416,547	161,843	38.8
UNIMI	381,525	176,262	48.1

Key publication type coverage (averaged): proceedings (82.2%), book chapters (70.4%), journal articles (67.5%), monographs/critical editions (<10%) (Andreose et al., 31 Jan 2026).

4. Access, APIs, and Integration

OpenCitations Index data is accessible via multiple standard interfaces:

SPARQL Endpoint: Enables complex graph queries. Example: retrieving all works cited by a particular OMID or DOI.
RESTful APIs: Modular endpoints for fetching citation lists, individual citation metadata, and performing bulk queries. API returns are available in JSON, Turtle, CSV, N-Triples, and Scholix formats.
Bulk Dataset Dumps: Freely downloadable under a CC0 public-domain waiver. Provenance and change-tracking snapshots are provided in all major releases.
Web Interfaces: YASGUI for SPARQL exploration, OSCAR and LUCINDA for browsing and visual citation network exploration (Heibi et al., 2024, Heibi et al., 2019, Peroni et al., 2019).

Third-party tools directly exploiting the Index include VOSviewer (for citation network visualization), Citation Gecko (literature-mapping), OCI Graphe (interactive citation graphs), and reference management plugins for tools like Zotero (Heibi et al., 2019).

5. Data Quality Assurance and Curation

The OpenCitations Index is governed by a robust data-quality regime. Core components include:

Structured Data Model: OCDM enforces that every citation entity links exactly one citing and one cited bibliographic resource, each with at least one persistent identifier.
Pre-Ingestion Validation: The oc_validator tool checks for well-formedness, ID syntax and existence, and semantic compliance on all metadata and citation batches prior to ingestion (including crowdsourced input via CROCI).
Post-Ingestion Monitoring: The oc_monitor tool schedules weekly SPARQL-driven diagnostics—detecting, quantifying, and trending issues such as duplicate IDs, missing fields, inconsistent types, and orphaned references. Up-to-date dashboards are exposed publicly.
Error Reporting: Errors and warnings are always traceable to their dataset location (row, field, index), and error types are precisely enumerated. All validation and monitoring results are archived for transparency (Heibi et al., 16 Apr 2025).

Only ∼1% of bibliographic resources or agent entities in OpenCitations Meta have ID duplication issues; page-range formatting, self-citation, and ID-existence checks are systematically logged (Heibi et al., 16 Apr 2025).

6. Crowdsourced Extension: CROCI and Community Submission

To address persistent gaps—particularly for publisher-closed references, or citation types not deposited in Crossref—the Crowdsourced Open Citations Index (CROCI) permits direct submission of citation data by scholars, publishers, and librarians. Community-contributed citation triples (citing_id, cited_id, publication_dates) are validated for well-formedness, syntactic adherence to accepted pattern, and deduplicated against existing Index contents (Heibi et al., 2019, Massari et al., 2022).

CROCI is governed by a fully open, CC0-licensed model; deduplication rules ensure that only unique citations across the combined set (COCI plus CROCI) are retained, maximizing network coverage.

7. Current Limitations and Future Directions

Despite rapid growth, several limitations persist:

Coverage gaps for SSH outputs are primarily due to the paucity of persistent identifiers (ISBNs, DOIs) in those publication types and publisher reluctance to deposit open reference data.
Citations involving works lacking DOIs or nonstructured identifiers remain underrepresented. Matching pipelines for incomplete references (using title, author, year, etc.) achieve high precision (100%, recall 75.7% in recent Gold Standard tests) but require further development for broader recall (Guenci et al., 23 Nov 2025).
Type, venue, and date metadata inconsistencies remain nontrivial, especially when integrating multiple sources with heterogeneous schemas.
Periodic time-lags, and snapshotting versus near-real-time ingestion, mean that bibliographic coverage always trails the latest publication records by several months (Andreose et al., 31 Jan 2026, Andreose et al., 10 Jan 2025).

Enhancements under implementation or consideration include improved matching for non-DOI citations, ingestion of additional identifier schemes (ARK, Handle, ISNI), integration of machine-learned title/author matching, and tighter alignment with open research information standards (e.g., EOSC Semantic KG Interoperability Framework).

References

(Heibi et al., 2024) The OpenCitations Index
(Peroni et al., 2019) OpenCitations, an infrastructure organization for open scholarship
(Heibi et al., 2019) COCI, the OpenCitations Index of Crossref open DOI-to-DOI citations
(Andreose et al., 31 Jan 2026) Assessing and Comparing the Coverage of Publications of Italian Universities in OpenCitations
(Heibi et al., 16 Apr 2025) Validating and monitoring bibliographic and citation data in OpenCitations collections
(Andreose et al., 10 Jan 2025) Analysing the coverage of the University of Bologna's publication metadata in an existing source of open research information
(Guenci et al., 23 Nov 2025) A pipeline for matching bibliographic references with incomplete metadata: experiments with Crossref and OpenCitations
(Heibi et al., 2019) Crowdsourcing open citations with CROCI -- An analysis of the current status of open citations, and a proposal
(Massari et al., 2022) How to structure citations data and bibliographic metadata in the OpenCitations accepted format