Papers
Topics
Authors
Recent
2000 character limit reached

LKD-KGC: Unsupervised Domain KG Extraction

Updated 2 February 2026
  • LKD-KGC framework is an unsupervised methodology that constructs domain-specific knowledge graphs by leveraging LLM-inferred inter-document dependencies.
  • It integrates a three-stage process—dependency evaluation, schema definition, and triple extraction—to autonomously extract entities and relations without predefined ontologies.
  • Empirical evaluations show that LKD-KGC achieves 10–20% improvements in precision and recall over existing methods on various technical corpora.

The LKD-KGC framework is an unsupervised methodology for constructing domain-specific knowledge graphs (KGs) by leveraging LLMs to infer and utilize knowledge dependencies across a document repository. Distinct from prior schema-guided and reference-based approaches, LKD-KGC autonomously analyzes corpus structure, prioritizes processing order, autogenerates entity schemas from hierarchical inter-document context, and extracts entities and relations without requiring predefined ontologies or external KGs. This sequence of modules enables high-fidelity KG extraction tailored to specialized domains lacking existing schema or public knowledge resources, and achieves measurable improvements in precision and recall over contemporaneous baselines (Sun et al., 30 May 2025).

1. Motivation and Distinctions from Prior Work

Schema-guided KGC techniques such as CoT-Ontology, RAP, AdaKGC, and EDC presuppose the availability of a predefined schema or focus exclusively on intra-document entity-relation patterns. In technical or proprietary corpora, such as internal system documentation, these assumptions break down: schemas are unavailable and cross-document conceptual dependencies are essential for accurate KG modeling. Reference-based methods like AutoKG, SAC-KG, and KBTE, which leverage external KGs (e.g., DBpedia, Wikipedia taxonomies) to augment LLM prompts, similarly fail in restricted domains where reference resources do not exist or are inappropriate, and can result in factual hallucination or irrelevant entity extraction.

A further limitation of both paradigms is their neglect of the hierarchical progression of technical corpora—from foundational overviews to advanced topics and case studies—which encodes essential knowledge dependencies. LKD-KGC explicitly models such hierarchical and dependency structures to enable context-aware and semantically coherent schema induction and triple extraction. This allows for construction of KGs reflective of the deep interconnectivity and specificity encountered in real-world domain corpora.

2. High-Level Architecture and Workflow

LKD-KGC consists of three major modules executed sequentially:

  1. Dependency Evaluation: Analyses the document repository (represented as a directory tree or flat collection) to infer a knowledge dependency graph. Documents are modeled as nodes and directed edges (didj)(d_i \rightarrow d_j) indicate that understanding did_i is foundational for djd_j, with edge weights D(i,j)D(i,j) determined by LLM-derived probabilities.
  2. Schema Definition: Establishes entity types via an autoregressive process. The procedure begins by extracting candidate entity types from individual documents and their context-enriched summaries. All types are clustered using K-means (with kk chosen by silhouette score), and type definitions are consolidated by the LLM into a coherent schema S\mathcal{S}.
  3. Triple Extraction: Extracts entities and relations in an unsupervised manner, guided solely by the derived schema S\mathcal{S}. Entities are drawn from document text using schema-driven prompts, while relations (triples) are formed by examining candidate entity pairs for meaningful relationships, conditioned on the schema.

The schematic workflow is as follows:

Step Input Output / Operation
Dependency Evaluation Document repository DD Ordered docs, context-enhanced summaries
Schema Definition Ordered docs, summaries Entity type candidates, clusters, cleaned schema S\mathcal{S}
Triple Extraction S\mathcal{S}, documents Sets of extracted entities and triples

3. Knowledge Dependency Inference

The dependency evaluation module formalizes inter-document dependencies as a directed weighted graph G=(V,E)G=(V, E), where V=DV = D (set of documents), E={(didj)}E = \{(d_i \rightarrow d_j)\} if did_i provides foundational knowledge for djd_j. Dependency scores are computed as:

D(i,j)=fdep(di,dj)p(order(ij)Summi,Summj)D(i,j) = f_{dep}(d_i, d_j) \approx p(\text{order}(i \prec j) \mid \text{Summ}_i, \text{Summ}_j)

where fdepf_{dep} is inferred by a prompt-completed LLM.

The evaluation follows a two-phase traversal:

Bottom-Up Summarization: A post-order traversal produces summary representations SummdSumm_d for each document, with parent directories aggregating their children's summaries for higher-level abstraction.

Top-Down Prioritization: At each directory node, sibling summaries are ranked by the LLM to determine optimal reading/processing sequence, using the prompt:

Given two document summaries A and B, decide which should be read first to build correct understanding. Output “A” or “B” and a confidence score in [0,1].

Leaf nodes are then augmented with top-kk similar prior summaries (retrieved via embedding-based retrieval, k=10k=10) to generate context-enhanced summaries SummdSumm^\prime_d.

The ordering determined in this module guides schema induction and subsequently structures the triple extraction phase, ensuring that knowledge buildup mimics the latent pedagogical order of the corpus.

4. Autoregressive Entity Schema Generation

Schema generation is modeled as an autoregressive process over the context-enriched summary sequence C={Summd1,,SummdN}C = \{Summ'_{d_1}, \ldots, Summ'_{d_N}\}. Formally, the schema S\mathcal{S} is generated as a sequence of tokens s1,,sTs_1,\ldots,s_T with joint probability:

p(SC)=t=1Tp(sts<t,C)p(\mathcal{S} \mid C) = \prod_{t=1}^T p(s_t \mid s_{<t}, C)

For each document dd, candidate entity types EdE_d are extracted using LLM prompts conditioned on d.textd.text and SummdSumm'_d. All types from the corpus are clustered by K-means (embedding-based, kk determined by maximizing mean silhouette coefficient) to merge redundancies. Each cluster's members are consolidated and defined by the LLM, using retrieved contextual summaries to resolve ambiguity and maintain semantic granularity.

This schema S={(Typem,Defm)}m=1M\mathcal{S} = \{(\text{Type}_m, \text{Def}_m)\}_{m=1}^M serves as the sole structural prior for the KG extraction step.

5. Schema-Guided Unsupervised Extraction of Entities and Relations

With the schema S\mathcal{S} established, entities are extracted from each document by prompting the LLM:

Given schema S\mathcal{S}, which valid entities of each type appear in this text? List them.

A candidate ee of type TT from document dd is retained if the LLM assigns:

ValidEntity(e,T;d)=1    if    p(e has type Td.text,DefT)>τe\text{ValidEntity}(e,T;d) = 1 \;\; \text{if} \;\; p(e \text{ has type } T \mid d.text, \text{Def}_T) > \tau_e

where τe\tau_e is a threshold ($0.5$ by default).

Relation extraction proceeds by enumerating pairs of extracted entities and prompting for meaningful relationships:

Among these entities, which pairs share a meaningful relationship? Provide triples (subject, predicate, object).

The LLM emits a set of triples Rd={(ei,r,ej)p(rmention_context(ei,ej),S)>τr}R_d = \{(e_i, r, e_j) \mid p(r \mid \text{mention\_context}(e_i,e_j), \mathcal{S}) > \tau_r\}, thresholded analogously.

6. Implementation Details and Empirical Evaluation

The primary LLMs used are Meta-Llama-3.1-70B-Instruct and DeepSeek-R1-Distill-Qwen-14B, with embedding retrieval performed via paraphrase-multilingual-MiniLM-L12-v2. Essential hyperparameters include context size k=10k=10 for top-kk retrieval, LLM temperature $0.1$ for stable completions, and entity/relation thresholds τe=0.5\tau_e=0.5, τr=0.5\tau_r=0.5. Clustering for schema definition employs K-means with kk set by silhouette maximization.

Evaluation is conducted on:

  • Prometheus documentation (62 public API/tutorial pages)
  • Re-DocRED (15 Windows-centric documents)
  • IMS internal documentation (46 technical specs)

Baselines include EDC, AutoKG, and KBTE. Key metrics are precision of extracted triples (via LLM-as-judge), recall (total true triples per document), and F1 score on Re-DocRED using LLM equivalence judgments. Representative results (Meta-Llama-3.1-70B-Instruct):

Dataset Metric EDC AutoKG KBTE LKD-KGC
Prometheus Precision (%) 65.4 65.7 75.6 83.4
Total triples 3,784 2,678 2,216 4,561
Avg. triples/doc 61.0 43.2 35.7 73.6
Re-DocRED Precision (%) 17.7 22.8 32.3 36.3
Recall (%) 14.6 20.9 18.8 25.1
F1 (%) 16.0 20.9 23.8 29.7
IMS (Private) Precision (%) 73.2 56.7 80.4 89.6
Total triples 1,852 2,296 1,637 2,672
Avg. triples/doc 40.3 49.9 35.6 58.1

LKD-KGC demonstrates consistent gains of 10–20% in both precision and recall compared to baselines.

7. Strengths, Limitations, and Contextual Significance

Notable advantages of LKD-KGC include its autonomy (removal of manual schema and external KG dependencies), capacity to model global dependency structures for improved context integration, and applicability to both public and private domains.

Key limitations are the computational cost due to multiple LLM invocations (summarization, ranking, extraction), possible context truncation or omission in vector retrieval, need for domain-sensitive hyperparameter tuning, and susceptibility to LLM hallucinations during both summarization and triple extraction, particularly in large repositories.

This approach suggests a trajectory where knowledge graph construction becomes increasingly autonomous, adaptable, and domain-specialized, provided advances in efficient LLM deployment and context management continue (Sun et al., 30 May 2025).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to LKD-KGC Framework.