ORKG: Open Research Knowledge Graph

Updated 10 January 2026

ORKG is a semantic infrastructure that structures scholarly knowledge as interlinked RDF triples enriched with provenance for systematic analysis.
It employs a microservice architecture with automated data ingestion, SPARQL endpoints, and AI-based extraction to enable dynamic, FAIR-compliant research comparisons.
By integrating manual curation with automated predicate recommendation, ORKG enhances reproducibility, transparency, and scalability in scholarly communications.

The Open Research Knowledge Graph (ORKG) is an end-to-end infrastructure for capturing, curating, and disseminating scholarly knowledge as a fine-grained, semantically structured, machine-actionable graph. Moving beyond document-centric paradigms, ORKG encodes the problems, methods, results, datasets, and artifacts described in scientific literature as interlinked RDF-style triples enriched with provenance, thus enabling systematic, quantitative, and qualitative comparison, advanced search, AI-based extraction, and FAIR (Findable, Accessible, Interoperable, Reusable) compliant publication of research content (Jaradeh et al., 2022, Brack et al., 2020, Brack et al., 2021).

1. Motivation and Fundamental Concepts

Contemporary scholarly communication remains predominantly document-based, impeding machine processability and inhibiting synthesis, reproducibility, and meta-analysis. ORKG addresses this by exposing a semantic substrate underlying research articles' key facets—problems addressed, methods employed, materials used, datasets analyzed, results obtained—structuring them as granular, interlinked statements (Jaradeh et al., 2022, Runnwerth et al., 2020).

Conceptually, ORKG formalizes the knowledge graph as $G=(R,P,S,A)$ , where $R$ is a set of resources (entities), $P$ is a set of properties, $S \subseteq R \times P \times R$ is the set of statements, and $A: S \to \mathcal{P}(M)$ is an annotation mapping assigning metadata (e.g., provenance, timestamps, editors) to statements. The minimal schema for research contributions is captured as:

$\text{ResearchContribution} \equiv \exists\,\text{hasProblem}.\,\text{ResearchProblem} \,\sqcap\,\exists\,\text{hasMethod}.\,\text{ResearchMethod} \,\sqcap\,\exists\,\text{hasResult}.\,\text{ResearchResult}$

(Jaradeh et al., 2022)

2. Architecture and System Components

ORKG is architected as a microservice ecosystem with the following logical layers (Jaradeh et al., 2022, Karras, 2024, Karras et al., 2021):

Data Ingestion/Curation Interface: Provides DOI-based lookup (via CrossRef), domain classification dropdowns, and a wizard for guided entry of contributions. Metadata acquisition is automated; semantic content is community-curated via structured templates, optionally using SHACL node shapes for validation and structure enforcement.
Backend Knowledge Graph Store: A RDF triplestore (e.g., Blazegraph, Virtuoso) or property-graph DB (e.g., Neo4j) houses all entities, properties, and triples. Each node/resource/statement is assigned a persistent, dereferenceable URI.
Auxiliary Indexes: An inverted index (e.g., Elasticsearch) accelerates autocompletion, lookup, and faceted search.
Comparison and Query Subsystems: Automated "State-of-the-Art" comparisons, visualizations, and a robust SPARQL endpoint support advanced, graph-pattern queries; REST and JSON-LD endpoints enable programmatic interaction.
Versioning and Provenance: Every contribution is version-controlled; immutable snapshots are produced when publishing, with provenance maintained via explicit provenance chains using the PROV-O ontology (Haris et al., 2022).

By combining a lightweight, expressive schema with UI-driven curation and modular interfaces, ORKG makes both manual and (semi-)automatic enrichment tractable across disciplines.

3. Data Modeling, Semantics, and FAIR Compliance

ORKG's modeling approach is optimized for both semantic rigor and coverage (Brack et al., 2020, D'Souza et al., 2022, Karras, 2024):

RDF/OWL Foundation: Entities (papers, datasets, methods, results) and properties are modeled as first-class RDF resources. Templates based on SHACL restrict and document expected schema patterns for recurring comparison types, e.g., leaderboards (Task–Dataset–Metric–Value) (Kabongo et al., 2023).
Provenance & Reusability: Each triple is annotated with creation timestamp and the asserting user. All data is accessible via HTTP(S), published as JSON-LD/Turtle/CSV, and licensed CC-BY or CC0.
Ontology Integration: Established ontologies (e.g., BioAssay Ontology, schema.org/Dataset, QUDT) are imported or aligned as needed for domain coverage (Anteghini et al., 2020, Ahmad et al., 2024).
FAIR Principles: ORKG explicitly implements all key FAIR components (Karras, 2024, Oelen et al., 2020):
- Findable: Dereferenceable URIs, indexed search
- Accessible: Open APIs (SPARQL, REST), persistent IDs
- Interoperable: RDF, OWL, import of domain ontologies
- Reusable: Provenance, versioning, explicit templates

4. Collaborative Curation, Community Maintenance, and Crowdsourcing

ORKG supports and incentivizes crowd-based, community-driven enrichment, aligning with requirements for both high-precision expert curation and scalable population (Karras et al., 2021, Karras, 2024):

Curation Workflows: Curators ingest metadata (often by DOI), select or define a template (with SHACL-driven forms), enter or review statements, and publish. Autocompletion favors reuse by suggesting existing resources and predicates, reducing duplications by ~30% (Jaradeh et al., 2022).
Provenance and Review: All statements remain community-editable; provenance tags record every edit, and versioning chains allow reversion and update management.
Incentives: Social (leaderboards, citable DOIs for comparisons), financial (curation grants), and feedback mechanisms (in-app chat, analytics) drive engagement.
Bulk, Automated, and AI-Assisted Curation: ORKG features Python clients and CSV-upload pipelines for batch annotation. Integration with NLP microservices (e.g., SciBERT-based semantifiers, clustering-based property recommenders) expands coverage and reduces manual burden (Anteghini et al., 2020, Oghli et al., 2022, Schaftner, 15 Feb 2025).

The KG-EmpiRE and KG-Leaderboard projects provide empirical evidence on the feasibility of ongoing community curation and the concrete gains in updatability and sustainability for living literature reviews and structured result tracking (Karras, 2024, Kabongo et al., 2023).

5. Automation, AI/NLP Pipelines, and Predicate Recommendation

Hybrid curation approaches—combining manual input with machine assistance—are central to ORKG's scalability and semantic alignment (Anteghini et al., 2020, John et al., 14 Apr 2025, Schaftner, 15 Feb 2025, Nechakhin et al., 2024):

Entity and Relation Extraction: Named-entity recognition, ontology linking, and SciBERT-based classification enable automated semantification for text-heavy or protocol-rich domains (e.g., bioassays, chemistry).
Predicate Recommendation and Clustering: Unsupervised clustering (K-means, agglomerative) over paper embeddings (TF-IDF, SciBERT) yield high-precision predicate group recommendations, fast-tracking vocabulary convergence across 44+ research fields (Oghli et al., 2022). Recommendation quality reaches macro $F_1>0.83$ , outperforming field-based or topic baseline selection.
LLMs: LLMs (GPT-3.5, Llama 2, Mistral) trained and prompted with advanced techniques (persona setting, Chain-of-Thought, few-shot examples, output constraints) can suggest properties/dimensions for new contributions, quadrupling property-URI match rates to ~40%. Assigning unique URIs and standardized labels to extracted properties enforces Linked Data and FAIR compliance (Schaftner, 15 Feb 2025). However, out-of-the-box LLMs show only partial alignment with expert-created property sets; targeted finetuning and prompt engineering lead to significant improvements (Nechakhin et al., 2024).
Hybrid Human-Machine Workflows: SciMantify demonstrates a five-stage evolution model, blending human column/cell annotation, property matching, and entity linking with automatic data-type inference and string-similarity matching, achieving order-of-magnitude efficiency gains (John et al., 14 Apr 2025).

6. Applications, Comparative Exploration, and Interoperability

ORKG's capabilities enable advanced applications in literature analysis, reproducible research, and knowledge integration:

Literature Surveys and Leaderboards: Side-by-side comparison modules (e.g., KG-EmpiRE for Requirements Engineering; ORKG-Leaderboards for AI benchmarks) surface structured overviews, statistical trends, and leaderboard plots. SPARQL interfaces and RDF Data Cube exports support programmatic downstream analysis (Karras, 2024, Kabongo et al., 2023).
Dataset Semantification and FAIR Dataset Publishing: The ORKG-Dataset content type couples schema.org and QUDT for fine-grained, comparable dataset metadata, linking datasets to publications and evaluations with statistical metrics and benchmarks—enabling advanced discovery and comparison (Ahmad et al., 2024).
Interlinking and DOI Registration: Persistent identification is ensured for ORKG Papers and Comparisons with DataCite-minted DOIs; immutable snapshots and provenance chains link successive versions. Metadata propagates automatically to Crossref, OpenAIRE, and ORCID (Haris et al., 2022).
Contextual Knowledge Enrichment: Federated GraphQL APIs aggregate metadata, citations, datasets, project funds, topics, and Altmetric data from DataCite, ORCID, OpenAIRE, Semantic Scholar, Wikidata, and others, integrating cross-platform contextual widgets into the exploration UI (Haris et al., 2022).
Manual, Template-Driven Modeling in Domain Sciences: For highly structured fields (e.g., operational research, mathematics), domain-specific templates can encode mathematical models, performance metrics, and experimental protocols at a granular level (Runnwerth et al., 2020).

7. Limitations, Challenges, and Future Directions

Current limitations include the need for sustained manual curation, varying domain coverage, persistent terminology divergence (predicate synonyms, homonymy), and incomplete automation for nuanced or multi-modal extraction (e.g., performance scores, model code/URLs in leaderboards) (Kabongo et al., 2023, Oghli et al., 2022). Ongoing research and deployment efforts propose:

Integration of rule-based and deep-learning extractors for complex property-value patterns (numerical extractions, chemical entities).
Dynamic, reputation-based curation and voting/ranking to manage conflicting edits.
Broader, community-driven vocabulary consolidation, canonical labeling, and ontology alignment to reduce schema drift.
Expansion of domain- and task-specific templates, along with AI-assisted recommendation engines directly embedded in the curation interface.
Real-time benchmarking on usability, completeness, and correctness metrics (e.g., coverage ratio, precision/recall), and scaling up interactive, live "review articles" that regenerate from the underlying KG as new data arrives (Karras, 2024, Karras et al., 2023, Schaftner, 15 Feb 2025).

ORKG thus represents a robust, extensible infrastructure unifying manual and automatic curation of scholarly knowledge, advancing machine-actionable, FAIR-compliant, and sustainable research communication. Its extensible architecture and growing ecosystem—including AI-based extraction, semantic enrichment, and contextual discovery—position it as a central node in the evolving landscape of open scholarly infrastructure (Jaradeh et al., 2022, Karras et al., 2023, John et al., 14 Apr 2025).