Knowledge graphs have emerged as a powerful tool for integrating, managing, and extracting value from diverse, dynamic, large-scale data. They represent real-world entities and their relations as a graph, offering flexibility compared to traditional relational models, particularly for incomplete or evolving data. This paper provides a comprehensive overview of knowledge graphs, covering their foundational data models, query languages, schema and identity management, deductive and inductive reasoning techniques, creation and refinement methods, publication strategies, and real-world applications.
At the core of any knowledge graph is a graph-based data model. Common models include directed edge-labelled graphs (like RDF), where nodes represent entities and directed labelled edges represent binary relations; heterogeneous graphs, where nodes and edges also have types; and property graphs, which allow arbitrary key-value pairs on nodes and edges. Graph datasets allow managing multiple graphs (named graphs) alongside a default graph, useful for provenance or integrating data from different sources. Efficient storage and indexing techniques are crucial for handling large graphs, often involving relational mappings (triple tables, vertical/property partitioning) or custom graph-native structures, sometimes distributed across machines.
Querying knowledge graphs relies on primitives like graph patterns (basic matching of graph structures allowing variables), which can be combined using relational operators (projection, selection, join, union, difference, anti-join) to form complex graph patterns. A distinguishing feature is support for navigational graph patterns using path expressions (regular expressions over edge labels) to match arbitrary-length paths. Query languages like SPARQL for RDF and Cypher/Gremlin for property graphs implement these primitives, sometimes including additional features like aggregation, federation, or support for semantic entailment.
Managing the diversity and scale of knowledge graphs often involves explicit representations of schema, identity, and context. Schema defines the high-level structure and semantics. Semantic schema (e.g., using RDFS or OWL) define the meaning of terms like classes and properties, enabling reasoning. Validating schema (e.g., using SHACL or ShEx) define constraints on the graph structure, useful for ensuring data quality or completeness assumptions (locally). Emergent schema extraction methods automatically discover patterns and structures in the graph, often using quotient graphs or graph summarization techniques, aiding human understanding and system optimization.
Identity management is crucial for linking equivalent entities within or across graphs. Persistent identifiers (like IRIs) help avoid naming clashes, particularly when integrating data from multiple sources. External identity links (like owl:sameAs
) explicitly state that different identifiers refer to the same real-world entity, enabling data merging and consolidation. Datatype values (like numbers or dates) are typically represented using specific formats (e.g., XSD literals in RDF) recognized by machines for querying and manipulation. Lexicalisation involves adding human-readable labels and descriptions to nodes for better understandability. Existential nodes (like RDF blank nodes) represent entities whose existence is known but not their specific identity, useful for incomplete data or complex structures like lists.
Context refers to the conditions under which a piece of knowledge is true (e.g., temporal, geographical, provenance). Context can be represented directly as data (e.g., dates as property values), using reification techniques (making statements about edges), higher-arity relations (linking edges to context nodes), or by using annotated graphs where edges are associated with values from a specific domain (e.g., time intervals, fuzzy values) and queries can reason over these annotations.
Deductive knowledge enables inferring new facts based on explicitly stated rules and logical axioms. Ontologies (like OWL) formally capture the meaning of terms used in a graph, defining semantic conditions that constrain interpretations of the graph and give rise to entailments. Reasoning engines can apply these semantics to derive new knowledge through logical deduction. Rules (like Datalog, SWRL, SPIN) provide an alternative, often more scalable, way to encode if-then style inferences. Rule application can be performed via materialization (adding entailed facts to the graph) or query rewriting (transforming queries to retrieve inferred facts from the original data). Description Logics provide a formal foundation for OWL, offering various fragments with different trade-offs between expressivity and computational complexity of reasoning tasks like satisfiability and entailment checking.
Inductive knowledge is acquired by generalizing patterns from observed data, leading to potentially fallible predictions. Graph analytics applies algorithms from graph theory and network analysis (e.g., centrality, community detection, connectivity, path finding) to extract insights from the graph topology. Frameworks like GraphX, GraphLab, and Pregel support large-scale graph analytics, often using iterative, message-passing paradigms. Knowledge graph embeddings learn low-dimensional numeric representations (vectors) of nodes and edges, typically through self-supervised training. These embeddings capture latent features and enable tasks like link prediction (predicting missing edges) and similarity computation. Translational models (TransE, TransH, TransR, etc.) interpret relations as transformations in the embedding space. Tensor decomposition models (RESCAL, DistMult, HolE, ComplEx, TuckER) represent the graph as a tensor and decompose it to find latent factors. Neural models (SME, NTN, MLP, ConvE, HypER) use neural networks to learn complex scoring functions for plausibility. LLMs (RDF2Vec, KGloVe) adapt text embedding techniques for graphs. Graph Neural Networks (GNNs) build neural network architectures based on the graph structure itself, learning functions to aggregate information from neighbors for tasks like node classification (type prediction). Symbolic learning techniques (rule mining, axiom mining) aim to learn interpretable logical formulae (rules, DL axioms) from the graph, often using techniques from Inductive Logic Programming or differentiable methods.
Creating and enriching knowledge graphs involves integrating data from diverse sources. Human collaboration is valuable for manual curation, schema definition, and data verification, though costly. Text sources require information extraction techniques like Named Entity Recognition, Entity Linking (disambiguating mentions to existing entities), and Relation Extraction (binary or n-ary), often with joint learning across tasks. Markup sources (like HTML) can be processed using wrapper-based methods (learning extraction patterns), specialized web table extraction techniques, or Deep Web crawling to access data behind web forms. Structured sources (relational databases, JSON, XML) can be mapped to graph models using direct mappings (automatic transformations) or custom mappings (defined using languages like R2RML for RDB to RDF/property graphs), with data either materialized (stored explicitly) or virtualized (queried on demand). Existing knowledge graphs can also be used as sources, requiring graph querying and link discovery (finding links across graphs). Schema and ontology creation can follow structured methodologies (ontology engineering) or use automated techniques (ontology learning) to extract terminology and axioms from text or existing KGs.
Quality assessment is essential for understanding the reliability of a knowledge graph. Key quality dimensions include accuracy (syntactic, semantic, timeliness), coverage (completeness, representativeness), and coherency (consistency with schema, validity against constraints), each with specific metrics for quantitative evaluation.
Refinement techniques aim to automatically or semi-automatically improve the quality of a knowledge graph. Completion addresses missing information through link prediction, which can be general, or specialized for type links or identity links (entity matching/resolution). Correction identifies and removes incorrect information through fact validation (assigning plausibility using external sources or embeddings) or inconsistency repairs (resolving logical contradictions based on ontological axioms).
Publishing knowledge graphs makes them accessible for reuse. Principles like FAIR (Findable, Accessible, Interoperable, Reusable) and Linked Data (using HTTP IRIs for identification and access, providing data in standard formats, linking to related entities) provide guidelines. Access protocols range from simple dumps (downloading the whole graph) to node lookups (retrieving descriptions for individual entities), edge patterns (querying single edges), or complex graph patterns (executing complex queries on the server), each with trade-offs in bandwidth and server computation. Usage control mechanisms are important for licensing (defining terms of use with languages like ODRL), usage policies (specifying allowed processing/purposes), encryption (controlling access to sensitive data), and anonymization (protecting privacy through techniques like k-anonymity or differential privacy).
Knowledge graphs are widely used in practice in various sectors. Prominent open knowledge graphs like DBpedia, YAGO, Freebase, and Wikidata cover broad domains, often extracted from Wikipedia and other open data sources, serving as central hubs of structured knowledge and sources for downstream applications. Domain-specific open KGs exist in areas like media, government, publications, geography, life sciences, and tourism. Enterprise knowledge graphs are proprietary assets used by companies across industries (web search, commerce, social networks, finance, etc.) for applications ranging from semantic search and recommendations to business analytics, risk assessment, and automation. These often integrate diverse data sources, employ lightweight schemata, leverage machine learning and sometimes deductive reasoning, and prioritize scalability and keeping data up-to-date.
Future research directions for knowledge graphs lie at the intersection of various fields, addressing challenges such as scalability for reasoning and learning, improving data and model quality, handling diverse and dynamic data, and enhancing usability to foster broader adoption. Specific areas of interest include formal semantics for modern graph models, reasoning over contextual data, developing entailment-aware learning models, and improving the expressiveness and interpretability of graph neural networks and symbolic learning methods.
The paper concludes that knowledge graphs offer a valuable approach to managing and leveraging knowledge by combining graph-based data abstraction with techniques from various AI and data management fields, with increasing adoption in both open data communities and industry.