Dependency-based Knowledge Graphs
- Dependency-based knowledge graph construction is a method that leverages explicit and latent dependency structures to map nodes and edges from diverse domains.
- It employs advanced dependency parsing pipelines and heuristic post-processing to extract accurate KG triples with near state-of-the-art efficiency.
- The approach extends to unsupervised schema induction and software vulnerability tracking, enabling robust transfer learning and scalable graph modeling.
Dependency-based knowledge graph (KG) construction comprises a class of techniques in which explicit or latent dependency structures—linguistic, knowledge, or software–library relationships—underpin the extraction or learning process for graph nodes and edges. Dependency-based approaches are central to scalable KG construction in knowledge-intensive domains, schema induction for complex documents, transfer learning from system logs or event streams, and the large-scale modeling of software package relationships and vulnerabilities.
1. Formal Foundations and Definitions
In dependency-based KG construction, the “dependency graph” is typically defined as a (possibly heterogeneous) graph in which nodes represent atomic entities (e.g., tokens, system processes, documents, software packages) and edges encode explicit dependency relations. These relations are domain-specific:
- Linguistic (syntactic) dependencies: Sentences parsed into subject–verb–object and prepositional constructs, forming semantic triples (head, relation, tail) (Min et al., 4 Jul 2025).
- Event or log dependencies: Entities (e.g., processes, files) with weighted edges indicating the strength or intensity of observed dependencies (e.g., "process reads file") (Luo et al., 2017).
- Knowledge dependencies: Documents or chunks arranged in a directed, weighted graph , where and encode prerequisite relations and conceptual priority among chunks (Sun et al., 30 May 2025).
- Software/package dependencies: Libraries, versions, and CVEs as typed nodes, with directed edges indicating semantic relationships (e.g., “depends,” “has,” “verAffects”) (Jia et al., 2022).
This diversity of dependency semantics underpins methodology for accurate, domain-adaptive, and efficient graph construction.
2. Dependency Parsing and Linguistic Extraction Pipelines
Dependency parsing is a principal method for extracting relational triples from unstructured text. Modern pipelines deploy parsers such as spaCy’s Universal Dependencies framework to generate directed graphs of tokens and labeled dependency arcs. These are transformed into KG triples via pattern extraction and heuristic post-processing (Min et al., 4 Jul 2025).
Key pipeline steps:
- Preprocessing and chunking: Hierarchical chunking of documents, sentence segmentation, and noise reduction (e.g., dropping verbless sentences).
- Dependency parsing: Application of Universal Dependencies-style parsers, with passive-voice normalization and phrasal merging to normalize relation extraction.
- Triple extraction: For each parsed tree , extract all fitting subject-verb-object or verb-prep-object patterns. Confidence scores are computed on features such as token distance and dependency label reliability.
- Heuristics: Coreference resolution and linear regex patterns to retrieve overlooked relations.
- Normalization/materialization: Entity filtering (e.g., minimum length), node/edge schema assignment, and population of graph databases (e.g., iGraph) and vector indices (e.g., Milvus).
Dependency-based pipelines achieve 94% of state-of-the-art LLM-extraction performance in multi-hop reasoning KGs at a fraction of the computational cost (Min et al., 4 Jul 2025). Empirical results using standard IR metrics and semantic alignment scores show viability for enterprise-scale deployment, with limitations only in missing highly contextual or implicit relations.
3. Knowledge Dependency Graphs in Unsupervised Schema Induction
The construction of domain-specific KGs from unstructured corpora with complex inter-document dependencies is addressed via the modeling of a “knowledge dependency graph” (Sun et al., 30 May 2025). The central process operates as follows:
- Document-level dependency graph formation: Given a corpus , instantiate a directed, weighted graph where and captures directed prerequisite relationships with weights .
- LLM-based dependency inference: For each directory-level set of siblings, LLMs are prompted to produce pairwise or list-based rankings of prerequisite order, yielding a Kemeny-optimal total order via
- Context-aware schema induction: The global read order derived from the dependency graph is used to autoregressively generate context-enhanced chunk summaries and extract entity type candidates. Embedding-based clustering deduplicates and canonicalizes type labels, with the LLM providing concise definitions.
- Schema-guided entity/relation extraction: Entities and triples are extracted with reference to the induced schema, eliminating reliance on predefined label sets or public reference knowledge.
LKD-KGC demonstrates substantial gains in triple-level precision and recall (10–20 percentage points) over unsupervised LLM and schema-agnostic baselines in domain-specific settings (Sun et al., 30 May 2025).
4. Dependency Graph Transfer and Heterogeneous Event Streams
In system monitoring and security domains, dependency-based graph learning is concerned with the transfer of rich dependency graphs between source and target domains (Luo et al., 2017). The ACRET framework exemplifies this:
- Entity Estimation Model (EEM): Categorical entities are embedded in so that their distances reflect weighted sums of meta-path similarities (e.g., Process–File, File–Socket). Embedding and meta-path weights are alternately optimized to maximize correlation between entities. Relevance is scored and filtered via hypothesis testing ().
- Dependency Construction Model (DCM): For the merged candidate entity set, edges are inferred by solving
enforcing graph smoothness and domain-consistency. Stochastic gradient descent is used for optimization, balancing the fit to the observed target subgraph and matching the inter-domain discrepancy.
- Workflow: The process consists of meta-path selection, alternating EEM optimization, entity merging, DCM optimization, and recovery of the full target dependency graph.
ACRET accelerates dependency graph completion, achieving F₁ scores within 2–3% of ground-truth graphs with 4× less target data; it outperforms transfer and matrix-factorization baselines by 10–20 F₁ points (Luo et al., 2017).
5. Software Ecosystem Dependency-Vulnerability KGs
Dependency-based construction extends to modeling software package dependencies and vulnerability propagation, as in the Cargo ecosystem (Jia et al., 2022). The formal KG construction is as follows:
- Graph definition: with explicit node types (Library, Version, CVE) and edge types ("has", "depends", "libAffects", "verAffects").
- Parsing algorithm: A custom traversal computes the full transitive closure of all dependencies or, in reverse, the pervasiveness of a CVE across reachable packages. The algorithm strictly tracks semver constraints and respects Cargo’s development/optional flags.
- Propagation analysis: Metrics such as propagation depth, affected-library ratio, and affected-version ratio quantify vulnerability spread:
- Empirical results: In the Cargo KG, 28.6% of libraries and 19.78% of versions are affected by propagating vulnerabilities; memory bugs dominate. Only 1.7% of latest vulnerable package versions have been yanked, and 18% of affected libraries retain vulnerabilities in the latest release, pointing to slow patch adoption and insufficient remediation (Jia et al., 2022).
6. Retrieval, Embedding, and Hybrid Pipelines
Modern dependency-based KG construction is coupled to hybrid retrieval systems supporting multi-granular graph traversal and dense embedding search (Min et al., 4 Jul 2025). Key architectural elements:
- Triple-store and vector DB integration: Entities, chunks, and relations are embedded (e.g., 1,536-dim OpenAI text embeddings) and indexed separately in vector DBs such as Milvus.
- Hybrid retrieval: At query time, seed entities are extracted using both linguistic (noun phrase extraction) and vector search, driving 1-hop graph DB traversal to sample contextually relevant relations/chunks.
- Rank fusion: Reciprocal Rank Fusion (RRF) aggregates rankings from vector and graph traversal,
increasing retrieval precision and recall.
- Empirical comparison: Dependency-parsing-based KG construction retains 94% of LLM pipeline performance (Semantic Alignment: 61.87% vs. 65.83%) at an order of magnitude lower cost, while hybrid retrieval approaches outperform pure vector baselines by up to 15 percentage points (Min et al., 4 Jul 2025).
7. Limitations, Guidelines, and Future Research
Dependency-based KG construction methods are domain-adaptable and computationally efficient, but they present several practical limitations:
- Limitations: Contextual, implicit, or multi-hop dependencies may not be fully captured in first-order dependency parsers or local graph traversals (Min et al., 4 Jul 2025). Very deep transitive chains, especially in software ecosystems, may necessitate truncation (Jia et al., 2022).
- Parameterization: Regularization parameters (e.g., for smoothness), mixing weights ( for transfer reliance), and embedding dimensionalities must be tuned via small-scale validation (Luo et al., 2017).
- Ecosystem-adaptation: Domain transfer requires matching entity types or adapting version-constraint engines to the target software system (Jia et al., 2022).
Ongoing research aims to integrate multi-hop or learned traversal algorithms, subgraph caching, and generalization evaluations on diverse corpora (e.g., HotpotQA). Recent advances in context-aware, dependency-driven schema induction via LLMs further suggest robust, high-coverage KG construction is feasible without manual schema design, especially when reading order is dynamically optimized for inter-document dependencies (Sun et al., 30 May 2025).