Hierarchical Directed Knowledge Graph
- HDKG is a graph-based structure that organizes entities into multi-level, directed hierarchies using group assertions from SPO triples.
- It leverages Jaccard Similarity and Hub Promoted Index to measure group overlap and determine near-subset relationships for scalable hierarchy construction.
- The approach facilitates unsupervised type discovery and schema augmentation, supporting exploratory analysis and downstream learning tasks.
A Hierarchical Directed Knowledge Graph (HDKG) is a graph-based data structure in which entities are organized into multi-level, directed hierarchies, with group membership and subset relationships explicitly defined based on their connections in a source knowledge graph. HDKGs are designed to automatically extract, cluster, and hierarchically arrange entities without reliance on pre-labeled data, enabling robust discovery of latent entity groupings, subtype relationships, and hierarchical schemata even in noisy or incomplete knowledge graphs. HDKGs provide a mechanism to generate, in an unsupervised fashion, new entity types and categorical structure, substantially improving both the interpretability and completeness of real-world knowledge bases.
1. Formulation and Core Principles
An HDKG is constructed from a source knowledge graph (KG) composed of subject–predicate–object (SPO) triples. The key methodological reformulation is to treat each triple as a “group assertion” in which the subject is deemed to belong to a group determined by the predicate–object pair . Each group, therefore, is defined as:
The group label is formed by the concatenation of and (e.g., “LiveIn_Dublin”), and each subject can belong to multiple such groups. Critically, the set of all groups establishes the foundational level of the hierarchy.
To construct hierarchical links among these groups, pairwise group similarity is computed. Two primary similarity metrics are used:
- Jaccard Similarity:
- Hub Promoted Index (HPI):
Here, denotes the set of member entities in group and its cardinality. HPI is especially leveraged to determine near-subset relationships: if for a configurable threshold (default 0.9), then is considered a subgroup of .
2. Algorithmic Workflow
The HDKG construction pipeline follows a three-stage approach:
- Stage 1: Group Generation.
- All SPO triples are mapped to group assertions. Implementation involves parallel processing: triples are batched and processed independently, with local dictionaries of group memberships merged into a global dictionary.
- Rare or one-to-one relationship-induced groups are filtered out using a configurable minimum group size parameter , reducing the influence of noisy or idiosyncratic facts.
- Stage 2: Similarity Computation.
- For each group pair, both Jaccard and HPI similarities are computed. This computation is also parallelized for scalability.
- Only pairs surpassing the threshold for HPI are considered for hierarchical linkage, supporting robustness against missing or noisy data.
- Stage 3: Hierarchy Construction.
- Groups are added to the hierarchy as nodes. For each pair where , a directed edge from to is added, indicating a near-subset (i.e., hierarchical) relationship.
- The resulting directed acyclic graph encodes hierarchy; root nodes represent broad categories, internal nodes mid-level groups, and leaves highly specific groupings.
The following table summarizes these stages:
| Stage | Operation | Scalability Measures |
|---|---|---|
| 1. Grouping | Map triples to groups; filter rare | Parallel batch processing |
| 2. Similarity | Compute Jaccard & HPI for pairs | Parallel pairwise comparison |
| 3. Hierarchy | Link groups via HPI threshold | Subgraph construction |
3. Robustness, Scalability, and Parameterization
The HDKG methodology is explicitly engineered for noisy, sparse, and large-scale KGs. Its main provisions include:
- Minimum Size Filtering (): By excluding groups with fewer than members, the method avoids proliferation of spurious or non-generalizable groupings.
- Thresholded HPI (): Setting (e.g., 0.9) allows the system to recognize groups as subgroups even with incomplete overlap, thus addressing incompleteness and data corruption typical in real-world KGs.
- Full Parallelization: Both group construction and pairwise similarity calculations can be efficiently distributed across many processors or machines, making it viable for multi-million node graphs.
- No Schema or Label Dependency: The approach does not require schema or training labels, contrasting sharply with prior ontology-based or supervised methods.
4. Hierarchy Characteristics and Empirical Results
Empirical validation spans diverse benchmark KGs, including WN18, WN18RR (WordNet), FB13k (Freebase subset), YAGO10, and NELL239.
Salient characteristics of the extracted hierarchies include:
- Multi-level Structure: Root nodes (broad concepts such as “LiveIn_Europe”) recursively partition into finer subgroups (e.g., “LiveIn_Ireland”, “LiveIn_Dublin”), mirroring real-world type hierarchies.
- Semantic Consistency: Qualitative analysis confirms that the induced hierarchies correspond to true semantic and ontological distinctions (e.g., academic or geographic groupings, gender-based educational clusters).
- Novelty Discovery: The process uncovers hierarchical types not represented in existing ontologies, providing fresh candidate types and suggesting structure where none was curated.
A concrete example: in the FB13k dataset, the system formed a gender-based entity hierarchy for female entities, with nested subgroups corresponding to particular women’s colleges—a latent structure not hard-coded into the KG schema.
5. Comparison to Prior Approaches
HDKGs produced by the described unsupervised pipeline differ from both schema-driven and supervised/statistical entity typing frameworks:
- Schema-Based Methods: Require predefined, often brittle, ontologies; suffer when faced with schema drift or incomplete assertion coverage; provide no mechanism for novel type induction.
- Supervised Statistical Methods: Model type prediction as a multi-label classification problem over existing type sets, necessitating labeled data that may be scarce for rare or novel types; struggle with extreme class imbalance and large graph sizes.
- HDKG (Unsupervised): Automatically proposes and organizes new types, is inherently noise-tolerant, and maintains linear to sub-quadratic runtime through judicious filtering and parallelization.
6. Implementation Considerations and Scalability
All algorithmic stages—triple-to-group mapping, group size filtering, group similarity computation, hierarchical linkage—are amenable to distributed implementation. The use of basic set operations (intersections, unions), combined with hash-based group dictionaries, allows for in-memory or out-of-core processing. Empirical datasets ranged from tens of thousands to millions of entities, supporting the claim of scalability.
Robustness to noise is governed by the interplay of and . Lower increases sensitivity but may introduce noisy groups; controls the tolerance of inexact subgroup relationships. These parameters are stable across datasets in the tested ranges, reducing hyperparameter tuning effort.
7. Applications and Further Implications
HDKGs provide immediate practical benefits:
- Type Discovery and Schema Augmentation: The induced hierarchies supply new, data-driven type assertions for entities, supplementing or correcting hand-curated schema.
- Exploratory Analysis: Researchers may traverse the multi-level groupings to investigate concept drift, semantic relatedness, or cluster purity.
- Downstream Integration: Such hierarchies can inform subsequent learning tasks, including type-aware embedding, clustering, or rule mining; and provide filtering/augmentation for semi-supervised learning.
- Public Resource: The published group hierarchies serve as empirical baselines for future research on hierarchy induction and KG organization.
The method is particularly suited for domains where labeled data is scarce, schemas are incomplete, or the type system is evolving rapidly—settings common in enterprise ontologies, open knowledge bases, and scientific data integration.
In conclusion, the Hierarchical Directed Knowledge Graph (HDKG) approach constitutes a robust, scalable, and unsupervised method for extracting multi-level entity groupings from KGs. By re-interpreting SPO triples as flexible group assertions, quantifying group overlap with robust similarity measures, and constructing hierarchical relationships via thresholded inclusion, the method delivers semantically-rich, novel, and data-driven hierarchies. This unsupervised pipeline avoids the limitations of training data requirements and schema rigidity, making it a practical alternative for entity typing and hierarchical organization of evolving knowledge graphs (Mohamed, 2019).