Knowledge Graph Integration

Updated 12 September 2025

Knowledge graph integration is the process of consolidating diverse datasets into a unified semantic structure using nodes, edges, and global ontologies.
It employs paradigms such as ontology-driven integration, entity and schema alignment, and automated semantic enrichment to ensure efficient mapping and quality control.
The approach supports advanced reasoning and analytics across domains like biomedicine, geospatial analytics, and recommender systems, enabling scalable and interoperable data applications.

Knowledge graph integration is the process of consolidating heterogeneous, often semantically misaligned datasets—structured, semi-structured, and unstructured—into a unified, semantically meaningful representation based on nodes (entities), edges (relations), and, typically, a global ontology. This integration is central to generating KGs that support advanced reasoning, data analytics, and scientific discovery in domains ranging from biomedicine and geospatial analytics to recommender systems, robotics, and industrial applications.

1. Integration Paradigms

The foundational integration paradigms in knowledge graph construction involve various transformation and alignment strategies:

Ontology-Driven Integration: Approaches such as ConMap (Jozashoori et al., 2018) define a global ontology that provides the schema and integration logic. Mapping rules are created at the class (rather than attribute) level, enabling simultaneous semantification, curation, normalization, and integration of input data. This paradigm is akin to the Global-As-View (GAV) model, with the ontology acting as the semantic integration backbone.
Entity and Schema Alignment: Integration systems including RecKG (Kwon et al., 7 Jan 2025), OntoMerger (Geleta et al., 2022), and KnowWhereGraph (Zhu et al., 19 Feb 2025) standardize attribute names, resolve duplicates, and apply mapping functions for consistent node representation. Techniques span rule-based string normalization, preference-based merging (canonical targeting), and shortest-path-based connectivity of disconnected concepts.
Automated Extraction and Semantic Enrichment: Integration pipelines frequently combine NER, relationship extraction (e.g., using LLMs or REBEL models (Mohamed et al., 17 Dec 2024)), and linking to established identifiers (UMLS, Wikidata, S2 cell IDs (Zhu et al., 19 Feb 2025)), producing unified KG representations from disparate sources such as biomedical literature (Nentidis et al., 2019), textual corporate assets (Mihindukulasooriya et al., 2022), or image/video metadata.

These paradigms underpin robust KG integration and enable effective downstream querying and analysis.

2. Technical Methods and Mapping Workflows

Practical integration consists of concrete, multi-stage workflows that process and harmonize data:

Class-Based Mapping: ConMap (Jozashoori et al., 2018) maps all attributes of a given class in a data record to a single RDF node, as opposed to generating disjoint triples for each attribute. This consolidated mapping rule can be represented as:

$\text{MappingRule}(C): \forall r \in C,\, \text{emit} \left\{(a_1, r.a_1), ..., (a_n, r.a_n)\right\},\, r\, \text{rdf:type}\, C$

Schema Harmonization via Standardized Attribute Mapping: RecKG (Kwon et al., 7 Jan 2025) implements a mapping function $f(a) = a_\text{std}$ , ensuring that varying source data attributes are transformed to a unified schema prior to node merging. This enables seamless node union based on key identifiers (e.g., movie title and release date across datasets).
Ontology and Hierarchy Merging: OntoMerger (Geleta et al., 2022) deduplicates by producing canonical merge sets from mapping triples $[c, \Rightarrow, c']$ , then reconstructs a single connected hierarchy via shortest-path operations within the discrete hierarchy edge set.
Virtual Knowledge Graphs and Declarative Mappings: For structured data (e.g., 3DCityDB in CityGML integration (Ding et al., 2023)), declarative SQL–RDF mapping frameworks map relational tuples into ontological triples, exposing a "virtual" knowledge graph that is queryable via GeoSPARQL.

A representative mapping formula in this context is:

$M: (\text{table\_row}, \text{concept}) \mapsto \langle \text{subject, predicate, object} \rangle$

enabling relational data to inhabit a semantically enriched view.

3. Scalability, Efficiency, and Data Quality

Large-scale integration poses challenges in computation, deduplication, connectivity, and data validation:

Scalable Processing: Class-based (ConMap) and graph-database implementations (e.g., Neo4j as in iASiS (Nentidis et al., 2019), RecKG) mitigate linear growth in processing time as data complexity scales. Experiments show class-based workflows reduce RDFization time by up to 70% over attribute-based methods (Jozashoori et al., 2018).
Data Quality Control: Integration frameworks such as OntoMerger (Geleta et al., 2022) incorporate analytic tools (Pandas Profiling, Great Expectations) to automate the detection of duplicates, validating both schema alignment and instance-level merges. Control loops and SHACL/ShEx validation are employed for recursive refinement in deployment pipelines (Meckler, 20 Sep 2024).
Incremental and Real-time Integration: Automated pipelines like iASiS’ incremental harvester (Nentidis et al., 2019) and streaming ingestion in KnowWhereGraph (Zhu et al., 19 Feb 2025) maintain KG currency and support flexible querying by periodically integrating new literature, structured resources, or sensor data.

4. Challenges, Limitations, and Human–Machine Collaboration

Several fundamental challenges are prevalent in KG integration:

Semantic Heterogeneity and Mismatches: Problems include inconsistent attribute naming, overlapping or contradictory ontologies, or ambiguity in concept boundaries. Solutions involve standardization (RecKG, OntoMerger), canonical naming, and coordinated mapping schemes.
Deduplication and Connectivity: Disconnected subgraphs and duplicate entities/relations are resolved through merging algorithms and connectivity augmentation (e.g., directing unmapped nodes into the main hierarchy via shortest-path heuristics (Geleta et al., 2022)).
Handling Dynamic and Heterogeneous Data: Highly dynamic domains (industry, robotics, live data fusion) require continuous, often expert-in-the-loop, adjustment of mappings, curatorial decisions, and quality checks (Kyurem’s interactive programming widgets for notebook-based integration (Rahman et al., 5 Feb 2024)).
Limitations in Automated Methods: Automation can yield erroneous or semantically misleading triples, necessitating human expert review, especially during relation extraction, NER, and in high-stakes domains (Mohamed et al., 17 Dec 2024, Rahman et al., 5 Feb 2024).

Advanced integration workflows recognize the value of balancing automated extraction and curation with domain-expert oversight, enabling iterative improvement and correction.

5. Applications Across Domains

Knowledge graph integration supports a wide variety of domain-specific and cross-domain applications:

Biomedicine: Automated harvesting and integration of literature and database resources, often mapped to standard vocabularies (UMLS in iASiS (Nentidis et al., 2019); ontology-driven curation in ConMap (Jozashoori et al., 2018)), yield KGs supporting biomarker discovery, pharmacovigilance, and precision medicine.
Recommender Systems: Unified attribute mapping and graph database systems (RecKG (Kwon et al., 7 Jan 2025)) enable the merger of user, item, and interaction data for higher-order semantic discovery, explainable recommendation, and interoperability between datasets.
Geospatial Analytics: KnowWhereGraph (Zhu et al., 19 Feb 2025) provides a modular, spatially indexed schema (using S2 DGGS and GeoSPARQL) for integrating datasets on disasters, environment, and public health, enabling queries across multiple spatial and temporal scales.
Industry/Enterprise KGs: Procedure models (Meckler, 20 Sep 2024) and induction frameworks (Mihindukulasooriya et al., 2022) formalize iterative, competency question-driven approaches and combine structured and textual data for business analytics, risk assessment, and process optimization.
Robotics and 3D Mapping: Integration of sensor data, language-model embeddings, and scene semantics (using visual–textual fusion via deep learning (Igelbrink et al., 27 Nov 2024)) realizes online, adaptive knowledge integration directly into semantic navigation and scene understanding.
LLM-guided Information Seeking: Systems such as KNOWNET (Yan et al., 18 Jul 2024) and GMeLLo (Chen et al., 28 Aug 2024) leverage integration between LLMs (extracting and annotating triples from text) and validated KGs for tasks such as trustworthy health information seeking and precise multi-hop QA with up-to-date facts.

6. Evaluation, Metrics, and Future Directions

Integration Evaluation: Typical metrics include process time reduction (e.g., 70%+ speedup over baseline (Jozashoori et al., 2018)), reduction in duplicates (e.g., ~30% reduction in OntoMerger (Geleta et al., 2022)), attribute coverage and interoperability rating (RecKG), data property completeness (e.g., BDAKG (Nath et al., 18 Mar 2024)), and end-to-end improvement in downstream tasks (e.g., F1 gain in MT-CNN with retrofitted embeddings (Alawad et al., 2021)).
Quality Assurance Mechanisms: Adoption of SHACL, provenance modeling (PROV-O), and faceted search (for user feedback and trend analysis) ensures operational quality and facilitates iterative improvements.
Open Challenges: Dynamic, cross-domain integration remains complex, with outstanding issues in node resolution, semantic enrichment, complex mapping rule design, and efficient handling of scaling and update cycles (Geleta et al., 2022, Ilievski et al., 2020, Mihindukulasooriya et al., 2022, Zhu et al., 19 Feb 2025).

Anticipated advances include deeper fusion of symbolic and neural models, enhancement of auxiliary semantic layers (e.g., via LLM-driven extraction and validation), and broader adoption of human-centered, programmable integration interfaces.

Knowledge graph integration is a multi-layered, technically sophisticated process that marshals ontological modeling, mapping, standardization, deduplication, and pattern-driven reasoning to solve the core challenge of unifying complex heterogeneous data into coherent, queryable, and semantically rich structures. The body of literature underscores the need for comprehensive, scalable, and evolving integration workflows spanning diverse scientific and industrial domains, tightly coupled with mechanisms for quality control, human oversight, and expansion into new modalities and reasoning frameworks.