Papers
Topics
Authors
Recent
2000 character limit reached

Persistent Knowledge Hub (PKH)

Updated 4 January 2026
  • Persistent Knowledge Hub is a scholarly infrastructure that permanently versions and identifies machine-actionable scientific knowledge, embodying FAIR principles.
  • Implementations like ORKG and KnowledgeHub use graph-based data models and modular pipelines integrating REST/SPARQL interfaces with robust DOI minting systems.
  • PKHs support both automated machine learning and curated annotation workflows that ensure scalable, provenance-rich data interoperability within global scholarly networks.

A Persistent Knowledge Hub (PKH) is defined as a scholarly infrastructure that ensures the permanent, versioned, and globally discoverable identification of machine-actionable, structured knowledge artifacts originating from scientific literature. PKHs operationalize the FAIR data principles (Findable, Accessible, Interoperable, Reusable) by integrating information extraction pipelines, flexible ontologies, persistent identifier (PID) minting, versioned archiving, and interlinking with global scholarly communication systems. PKHs, such as those realized in the Open Research Knowledge Graph (ORKG) (Haris et al., 2022) and KnowledgeHub (Tanaka et al., 2024), enable both the automated and curated extraction of scientific contributions, their citation, comparison, and downstream computational reuse at web scale.

1. System Architecture and Data Models

PKHs typically employ modular, service-oriented architectures centered on graph-based data stores and versioned persistence. In the ORKG implementation, a Neo4j-based Graph Service maintains the live, editable graph of entities—encompassing "Papers," "Contributions," "Comparisons," "Concepts," and "Templates"—and provides REST/JSON and SPARQL/Cypher query interfaces. On publication, subgraphs are snapshotted as immutable JSON objects and archived in a separate versioning service (e.g., PostgreSQL+JSON), each with a monotonically increasing version number and corresponding DOI minted via DataCite's PID infrastructure.

KnowledgeHub operationalizes a pipeline-model architecture for scientific knowledge extraction. Core steps include PDF ingestion (GROBID), linguistic preprocessing (Stanza), ontology configuration (user-defined or imported OWL), interactive annotation (BRAT interface), model-based information extraction (span-based NER, BERT-based RC), construction of a property knowledge graph (in Neo4j), embedding-based passage retrieval, and retrieval-augmented QA via integrated LLMs. Persistent storage is partitioned: project metadata/ontologies/annotations in SQLite; the KG in Neo4j; retrieval embeddings in a Chroma vector store (Tanaka et al., 2024).

A canonical (informal) graph representation in these systems is:

G=(V,E,â„“V,â„“E,A)G = (V, E, \ell_V, \ell_E, A)

where VV is the set of nodes (documents, paragraphs, sentences, entities), E⊂V×VE \subset V \times V is the set of directed edges (containment and semantic relations), ℓV\ell_V and ℓE\ell_E map to node/edge labels, and AA assigns attributes such as provenance.

2. Ontology, Annotation, and Curation Framework

A defining feature of PKHs is their support for flexible, user-extensible ontologies governing the semantic types and relations extractable from text. Users may import external OWL ontologies (e.g., EMMO, BattINFO), select subsets thereof, or construct bespoke schemas by declaring entity and relation types, optionally with role or cardinality constraints (Tanaka et al., 2024). Ontology configurations are serialized (YAML/JSON) and drive both the annotation UI and downstream extraction models.

Annotation is conducted via browser-based tools—integrating BRAT for span and relation tagging—which output standoff JSON with entity/relation spans and rich provenance (annotator, timestamp) metadata. These annotations drive the training of machine learning IE models, populate the KG, and support calculation of inter-annotator agreement and type distribution statistics.

PKHs enable iterative curation cycles: users annotate a seed set, models are trained and apply auto-annotation (regex or ML), corrections are made, and models retrained—amplifying scalability and quality.

3. Information Extraction, Persistence, and Auto-Annotation

Machine-actionable structuring of knowledge in PKHs is achieved using supervised learning models for both named entity recognition (NER) and relation classification (RC). KnowledgeHub employs a span-based NER model following Yu et al. (2020): first, all candidate spans (i,j)(i,j) are scored (biaffine), top-k retained, and then classified via a BERT encoder; RC is performed on pairs of predicted entities in each sentence using a BERT-based classifier. Both models are trained with cross-entropy loss over the appropriate label sets.

Auto-annotation operates in two modes: high-precision regex matching for known entities, and full ML inference for draft annotation of unlabelled data. The iterative workflow—annotation, model training, auto-annotation—enables continuous enrichment and refinement of the knowledge base (Tanaka et al., 2024).

All extracted entities and relations are instantiated in a property knowledge graph. The formal entity-relation triple set is:

T={(ei,r,ej)∣ei,ej∈V,  r=ℓE(ei,ej)∈userRelations}T = \{ (e_i, r, e_j) \mid e_i, e_j \in V,\; r = \ell_E(e_i, e_j) \in \text{userRelations} \}

Incremental propagation, versioning, and rollback of KG state are supported via backend APIs interfacing with Neo4j.

4. Versioning, DOI Minting, and Provenance

PKHs enforce strict versioning and permanent identification of knowledge artifacts. In ORKG, each published Paper or Comparison is snapshotted, made immutable, and assigned a DOI via DataCite (DOI = 10.48366/R{internal-entity-ID}), with each update generating a new version (e.g., V0.1, V0.2) and potential new DOI (Haris et al., 2022). Metadata uploads include cross-referenced "relatedIdentifiers" such as ORIGINAL_ARTICLE_DOI and prior versions, supporting provenance chains.

Versioning architecture:

  • On publish, graph slices are exported to version tables (PostgreSQL+JSON).
  • Snapshots are immutable and linked to PIDs/DOIs.
  • Edits necessitate new version/DOI, with explicit provenance linking (<relatedIdentifier> of type IsNewVersionOf/IsPreviousVersionOf).
  • Full version/provenance chains are navigable both internally (previousVersionId) and externally (DataCite, PID Graph).

This approach renders each knowledge object globally citable, permanently resolvable, and auditably versioned.

5. Interoperability and Integration with Scholarly Infrastructure

Persistent Knowledge Hubs systematically interlink machine-actionable knowledge with global open scholarly infrastructures. By leveraging the DataCite PID ecosystem, published ORKG Papers and Comparisons propagate their metadata (authors, DOIs, licensing, provenance, citations) to Crossref, OpenAIRE (via OAI-PMH), ORCID, and the PID Graph, ensuring discoverability and alignment with community standards such as schema.org, DataCite metadata, and PROV-O ontologies.

Typical flows:

  • Author creates a structured Paper in ORKG, assigns metadata and references, and publishes to mint a DOI.
  • DataCite disseminates the object’s metadata (example in the DataCite Kernel-4 schema) to all integrated infrastructures.
  • GraphQL queries to PID Graph recover all machine-actionable descriptions, citations, or comparisons of a given scholarly object (e.g., by "citations.nodes.id").

All metadata and knowledge graph data are available in machine- and human-consumable formats (JSON-LD, XML), supporting programmatic queries, dashboard integration, and seamless scholarly communication (Haris et al., 2022).

6. Retrieval, Question Answering, and Graph-Based Insights

PKHs such as KnowledgeHub extend beyond static curation, embedding document passages for semantic retrieval and enabling LLM-grounded question answering. Each paragraph is embedded (e.g., using all-mpnet-base-v2), and user queries are embedded in the same latent space. Semantic similarity (cosine) ranks passages; a fixed top-k are retrieved and provided as context to LLMs (e.g., Llama-2, Mistral) via prompt templates.

LLM output comprises:

  • Unified natural language answers summarizing the retrieved contexts.
  • Per-paragraph provenance-rooted answers.
  • Extraction and visualization of KG subgraphs for all discovered entities and relations.

This architecture grounds QA in retrieved evidence (RAG), reduces hallucination, exposes structured provenance, and enables the user to iteratively refine the underlying semantic pipeline (edit ontology → re-annotate → re-train → re-query) (Tanaka et al., 2024).

7. Evaluation, FAIR Principles, and Usage Workflows

PKHs are explicitly designed to fulfill the FAIR principles:

  • Findable: Globally unique, resolvable DOIs with rich metadata in DataCite, OpenAIRE, PID Graph.
  • Accessible: Metadata and full machine-actionable descriptions are available via open REST, GraphQL, and landing pages.
  • Interoperable: Ontology alignment (schema.org, PROV-O), formal graph models, and metadata payloads in standard formats.
  • Reusable: Transparent versioning with explicit provenance, unambiguous creator identification (ORCID), CC-BY licensing, and snapshot citation.

Workflows supported include authoring and publishing new structured "ORKG Papers," updating existing descriptions (triggering versioning and new DOIs), discovery/citation via PID queries, exporting and comparing structured knowledge, and automated or manual curation workflows (Haris et al., 2022).

A plausible implication is that PKHs, with their persistent, interoperable, versioned, and machine-actionable semantic infrastructure, lay foundations for scalable meta-analyses, reproducible science, and the integration of open scientific assertions into computational pipelines.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (2)

Whiteboard

Topic to Video (Beta)

Follow Topic

Get notified by email when new papers are published related to Persistent Knowledge Hub (PKH).