Scholarly Knowledge Graphs Overview

Updated 13 December 2025

Scholarly Knowledge Graphs are structured, machine-actionable representations capturing entities and relationships in scientific literature for enhanced discovery.
Construction methodologies integrate heterogeneous data sources using ETL pipelines, NLP, and ontology mapping to enrich and standardize research artifacts.
Applications include faceted search, federated queries, recommender systems, and bibliometric analyses, driving advanced knowledge discovery and analytics.

Scholarly Knowledge Graphs (SKGs) are structured, machine-actionable representations of entities and relationships present in scientific literature, encompassing research articles, authors, institutions, datasets, software, methods, and diverse domain-specific entities. By semantically modeling research artifacts, SKGs underpin a new generation of knowledge-driven services for retrieval, comparison, enrichment, recommendation, and knowledge discovery in digital scholarly communication.

1. Formal Definition and Data Models

SKGs are typically modeled as edge-labeled, multi-relational graphs or as RDF(-like) triple stores. A canonical representation is a labeled directed graph: $G = (V, E, \lambda_N, \lambda_E)$ where:

$V$ is the set of nodes (entities), e.g., Paper, Author, Institution, Method.
$E \subseteq V \times V$ is the set of directed edges.
$\lambda_N$ labels nodes (URIs for resources, literals for data values).
$\lambda_E$ labels edges with predicates (properties).

Advanced models introduce qualified edges (n-ary relations with annotations) as in RDF* or attribute-augmented triple models:

Main edges: binary relations (h, r, t).
Qualifier edges: (r_q, o_q) annotate main edges to capture provenance, temporal, or quality information (e.g., citation year, author position).

Major public SKGs, such as PubGraph (Ahrabian et al., 2023), ORKG (Jaradeh et al., 2020, Oelen et al., 2020, Hussein et al., 2022), and EMAKG (Pollacci, 2022), feature tens to hundreds of millions of entities and billions of edges, mapped onto ontologies like Wikidata, QUDT, PIDINST, or custom schemas.

Entity types and relationships span:

Works (papers), authors, institutions, venues, concepts, datasets, software, research instruments, results, and methods.
Edges: authorship, publication, citation, affiliation, field-of-study, coauthor, method-used, reported-result, produced-by-instrument, and many others.

2. Construction Methodologies and Pipelines

SKG construction integrates heterogeneous data sources (e.g., OpenAlex, MAG, Wikidata, DataCite, ORCID, PANGAEA, Zenodo, software repositories) using ETL pipelines, ontology mapping, static and dynamic code analysis, NLP, and expert curation.

Data Integration & Ontology Alignment

Ingestion from source-specific APIs and dumps (JSON, RDF, CSV), harmonization to unified ontologies (e.g., mapping OpenAlex fields to Wikidata P-numbers (Ahrabian et al., 2023)).
Entity resolution via persistent identifiers (DOI, ORCID, ROR, DBpedia URI), matching on metadata, and secondary matching (title, affiliation, author name disambiguation) (Pollacci, 2022, Ahrabian et al., 2023).
Alignment of domain-specific attributes (e.g., instrument metadata in PIDINST (Haris et al., 17 Jul 2025), quantities and units in QUDT (Heidari et al., 6 Dec 2025)).

Semantic Enrichment and Extraction

Static code analysis and AST parsing of published software to extract data artifacts, computational techniques, and result structures, linked back to scholarly articles (Haris et al., 2023).
Table recognition from survey articles and comparison tables: extraction with tools like Tabula and GROBID, normalization and mapping to knowledge graph nodes (Oelen et al., 2020).
NLP techniques: named entity recognition (BERT-CRF), relation extraction, and fine-tuned LLMs for concept and property labeling (Haris et al., 17 Jul 2025, Rabby et al., 2024).

Human-in-the-Loop and Curation

Manual validation, alignment of property labels, and quality-control pipelines.
Peer-review forms and maturity models (e.g., KGMM's five-stage model (Hussein et al., 2022)) using Essential/Important/Useful quality measures for completeness, correctness, provenance, and linkability.

3. Querying, Search, and Contextual Enrichment

SKGs enable advanced retrieval scenarios beyond classical metadata search:

Faceted search: dynamic computation of property-based facets (e.g., method, dataset, result, location, date) in response to user queries (Heidari et al., 2021, Heidari et al., 2021, Heidari et al., 6 Dec 2025). Facets may be generated on-the-fly and support granularity adjustment, taxonomic faceting via external KGs (GeoNames) and integration of units (QUDT, UCUM).
Federated query services: GraphQL-based virtual integration over distributed scholarly infrastructures (DataCite, OpenAIRE, Semantic Scholar, Wikidata, Altmetric), enabling contextual widgets in paper profiles, contributor pages, and comparison views (Haris et al., 2022).
Question answering: neural table QA systems (e.g., JarvisQA) leverage triple-to-text conversions and transformer models to answer cell-level, aggregation, and conditional questions over SKG-derived comparison tables (Jaradeh et al., 2020).
Programmatic access: SPARQL, GraphQL, REST endpoints, custom APIs for accessing and traversing the full spectrum of scholarly graph content (Liu et al., 2022, Haris et al., 2022).

4. Representation, FAIR Principles, and Quality Models

The organization and publication of SKGs are governed by interoperability, accessibility, and data quality principles:

FAIR Compliance: Each comparison or artifact is assigned a persistent identifier, with rich interoperable metadata (RDF, JSON-LD, DataCube, Dublin Core), open licenses, and provenance (Oelen et al., 2020).
Maturity Models: KGMM's staged model enforces the progression from basic publication to completeness, representation, stability, and linkability. Pass/fail regimes on 20 quality measures—responsiveness, correctness, provenance, completeness, reusability, identifier stability—structure continuous improvement (Hussein et al., 2022).

5. Link Prediction, Embedding, and Completion Approaches

SKGs are inherently incomplete due to the rapid expansion and heterogeneity of scholarly output. Completion methods include:

Embedding models:
- TransE, ComplEx, TransH, TransR, TransESM (Soft Marginal TransE) tailored to scholarly KGs, with margin-based or soft-margin losses and performance up to 99.9% Hit@10 on domain-specific graphs (Nayyeri et al., 2019).
- Quaternion-based Trans4E for KGs with extreme N-to-M relation imbalance (e.g., millions of articles to tens of topics), outperforming competing methods in low and high dimensions on scholarly and general KGs (Nayyeri et al., 2021).
- Community-based negative sampling and adversarial splits for large-scale benchmarks (transductive/inductive KGC, zero-shot) in PubGraph (Ahrabian et al., 2023).
Transformer-based models:
- exBERT uses SciBERT-based triple sequence classification, leveraging scholarly context via type and synonym augmentation. Achieves superior link, relation, and triple classification accuracy over KG-BERT and KGE baselines; e.g., 97.1% accuracy on ORKG21, 96.0% on PWC21 (Jaradeh et al., 2021).
- LLMs for predicate and property recommendation, with methods for knowledge injection via CKGs, domain-adapted prompting/fine-tuning, and empirical gains up to +9 pp MAS on research field and predicate tasks (Rabby et al., 2024).

6. Applications: Comparison, Discovery, and Analytics

SKGs facilitate:

Research contribution comparison and surveys: Alignment algorithms group semantically equivalent properties across papers, supporting interactive tabular comparison, property clustering, and export (CSV, RDF, LaTeX) (Oelen et al., 2020, Oelen et al., 2020).
Instrument and method analytics: KGs link instruments (PIDINST) to datasets, calibrations, and publications, enabling queries over their role, usage, and impact (usage/network-centrality scores) (Haris et al., 17 Jul 2025).
Unit harmonization: Semantic normalization of measured data across articles using QUDT/UCUM ontologies and external conversion services enables structured search and comparison of scientific measurements across studies (Heidari et al., 6 Dec 2025).
Recommender systems: Collaborator, advisor, and related-scholars recommendation via random walk (MVCWalker), feature matching, and node embedding similarity (Liu et al., 2022).
Global analytics: EMAKG introduces author mobility and migration flows, ego network analysis, h-index computation, and linguistic feature enrichment at scale for science-of-science and bibliometric studies (Pollacci, 2022).

7. Evaluation, Benchmarks, and Limitations

Evaluation in SKG research is multifaceted:

Precision, recall, MRR, Hits@K on KG completion/link prediction (Nayyeri et al., 2019, Nayyeri et al., 2021, Jaradeh et al., 2021, Ahrabian et al., 2023).
QA precision at top-K and response latency (Jaradeh et al., 2020).
Qualitative FAIRness scoring and completeness audits (Oelen et al., 2020, Hussein et al., 2022).
User-centered studies on usability and discoverability remain limited, with calls for comprehensive A/B or ergonomic studies (Haris et al., 2022, Heidari et al., 2021).
Current limitations include incomplete integration of external sources, need for more robust ontology alignment, challenges in fully automated extraction from scholarly artifacts, and the ongoing need for human validation in alignment and quality control.

Scholarly Knowledge Graphs have established themselves as foundational infrastructure for computational science, digital libraries, and meta-research. Their ongoing development integrates advances in database systems, distributed querying, semantic web, NLP, representation learning, and human-machine collaboration, continually extending the range, depth, and FAIRness of academic knowledge management (Oelen et al., 2020, Hussein et al., 2022, Ahrabian et al., 2023, Heidari et al., 6 Dec 2025, Rabby et al., 2024).