LKD-KGC: Unsupervised Domain KG Extraction
- LKD-KGC framework is an unsupervised methodology that constructs domain-specific knowledge graphs by leveraging LLM-inferred inter-document dependencies.
- It integrates a three-stage process—dependency evaluation, schema definition, and triple extraction—to autonomously extract entities and relations without predefined ontologies.
- Empirical evaluations show that LKD-KGC achieves 10–20% improvements in precision and recall over existing methods on various technical corpora.
The LKD-KGC framework is an unsupervised methodology for constructing domain-specific knowledge graphs (KGs) by leveraging LLMs to infer and utilize knowledge dependencies across a document repository. Distinct from prior schema-guided and reference-based approaches, LKD-KGC autonomously analyzes corpus structure, prioritizes processing order, autogenerates entity schemas from hierarchical inter-document context, and extracts entities and relations without requiring predefined ontologies or external KGs. This sequence of modules enables high-fidelity KG extraction tailored to specialized domains lacking existing schema or public knowledge resources, and achieves measurable improvements in precision and recall over contemporaneous baselines (Sun et al., 30 May 2025).
1. Motivation and Distinctions from Prior Work
Schema-guided KGC techniques such as CoT-Ontology, RAP, AdaKGC, and EDC presuppose the availability of a predefined schema or focus exclusively on intra-document entity-relation patterns. In technical or proprietary corpora, such as internal system documentation, these assumptions break down: schemas are unavailable and cross-document conceptual dependencies are essential for accurate KG modeling. Reference-based methods like AutoKG, SAC-KG, and KBTE, which leverage external KGs (e.g., DBpedia, Wikipedia taxonomies) to augment LLM prompts, similarly fail in restricted domains where reference resources do not exist or are inappropriate, and can result in factual hallucination or irrelevant entity extraction.
A further limitation of both paradigms is their neglect of the hierarchical progression of technical corpora—from foundational overviews to advanced topics and case studies—which encodes essential knowledge dependencies. LKD-KGC explicitly models such hierarchical and dependency structures to enable context-aware and semantically coherent schema induction and triple extraction. This allows for construction of KGs reflective of the deep interconnectivity and specificity encountered in real-world domain corpora.
2. High-Level Architecture and Workflow
LKD-KGC consists of three major modules executed sequentially:
- Dependency Evaluation: Analyses the document repository (represented as a directory tree or flat collection) to infer a knowledge dependency graph. Documents are modeled as nodes and directed edges indicate that understanding is foundational for , with edge weights determined by LLM-derived probabilities.
- Schema Definition: Establishes entity types via an autoregressive process. The procedure begins by extracting candidate entity types from individual documents and their context-enriched summaries. All types are clustered using K-means (with chosen by silhouette score), and type definitions are consolidated by the LLM into a coherent schema .
- Triple Extraction: Extracts entities and relations in an unsupervised manner, guided solely by the derived schema . Entities are drawn from document text using schema-driven prompts, while relations (triples) are formed by examining candidate entity pairs for meaningful relationships, conditioned on the schema.
The schematic workflow is as follows:
| Step | Input | Output / Operation |
|---|---|---|
| Dependency Evaluation | Document repository | Ordered docs, context-enhanced summaries |
| Schema Definition | Ordered docs, summaries | Entity type candidates, clusters, cleaned schema |
| Triple Extraction | , documents | Sets of extracted entities and triples |
3. Knowledge Dependency Inference
The dependency evaluation module formalizes inter-document dependencies as a directed weighted graph , where (set of documents), if provides foundational knowledge for . Dependency scores are computed as:
where is inferred by a prompt-completed LLM.
The evaluation follows a two-phase traversal:
Bottom-Up Summarization: A post-order traversal produces summary representations for each document, with parent directories aggregating their children's summaries for higher-level abstraction.
Top-Down Prioritization: At each directory node, sibling summaries are ranked by the LLM to determine optimal reading/processing sequence, using the prompt:
Given two document summaries A and B, decide which should be read first to build correct understanding. Output “A” or “B” and a confidence score in [0,1].
Leaf nodes are then augmented with top- similar prior summaries (retrieved via embedding-based retrieval, ) to generate context-enhanced summaries .
The ordering determined in this module guides schema induction and subsequently structures the triple extraction phase, ensuring that knowledge buildup mimics the latent pedagogical order of the corpus.
4. Autoregressive Entity Schema Generation
Schema generation is modeled as an autoregressive process over the context-enriched summary sequence . Formally, the schema is generated as a sequence of tokens with joint probability:
For each document , candidate entity types are extracted using LLM prompts conditioned on and . All types from the corpus are clustered by K-means (embedding-based, determined by maximizing mean silhouette coefficient) to merge redundancies. Each cluster's members are consolidated and defined by the LLM, using retrieved contextual summaries to resolve ambiguity and maintain semantic granularity.
This schema serves as the sole structural prior for the KG extraction step.
5. Schema-Guided Unsupervised Extraction of Entities and Relations
With the schema established, entities are extracted from each document by prompting the LLM:
Given schema , which valid entities of each type appear in this text? List them.
A candidate of type from document is retained if the LLM assigns:
where is a threshold ($0.5$ by default).
Relation extraction proceeds by enumerating pairs of extracted entities and prompting for meaningful relationships:
Among these entities, which pairs share a meaningful relationship? Provide triples (subject, predicate, object).
The LLM emits a set of triples , thresholded analogously.
6. Implementation Details and Empirical Evaluation
The primary LLMs used are Meta-Llama-3.1-70B-Instruct and DeepSeek-R1-Distill-Qwen-14B, with embedding retrieval performed via paraphrase-multilingual-MiniLM-L12-v2. Essential hyperparameters include context size for top- retrieval, LLM temperature $0.1$ for stable completions, and entity/relation thresholds , . Clustering for schema definition employs K-means with set by silhouette maximization.
Evaluation is conducted on:
- Prometheus documentation (62 public API/tutorial pages)
- Re-DocRED (15 Windows-centric documents)
- IMS internal documentation (46 technical specs)
Baselines include EDC, AutoKG, and KBTE. Key metrics are precision of extracted triples (via LLM-as-judge), recall (total true triples per document), and F1 score on Re-DocRED using LLM equivalence judgments. Representative results (Meta-Llama-3.1-70B-Instruct):
| Dataset | Metric | EDC | AutoKG | KBTE | LKD-KGC |
|---|---|---|---|---|---|
| Prometheus | Precision (%) | 65.4 | 65.7 | 75.6 | 83.4 |
| Total triples | 3,784 | 2,678 | 2,216 | 4,561 | |
| Avg. triples/doc | 61.0 | 43.2 | 35.7 | 73.6 | |
| Re-DocRED | Precision (%) | 17.7 | 22.8 | 32.3 | 36.3 |
| Recall (%) | 14.6 | 20.9 | 18.8 | 25.1 | |
| F1 (%) | 16.0 | 20.9 | 23.8 | 29.7 | |
| IMS (Private) | Precision (%) | 73.2 | 56.7 | 80.4 | 89.6 |
| Total triples | 1,852 | 2,296 | 1,637 | 2,672 | |
| Avg. triples/doc | 40.3 | 49.9 | 35.6 | 58.1 |
LKD-KGC demonstrates consistent gains of 10–20% in both precision and recall compared to baselines.
7. Strengths, Limitations, and Contextual Significance
Notable advantages of LKD-KGC include its autonomy (removal of manual schema and external KG dependencies), capacity to model global dependency structures for improved context integration, and applicability to both public and private domains.
Key limitations are the computational cost due to multiple LLM invocations (summarization, ranking, extraction), possible context truncation or omission in vector retrieval, need for domain-sensitive hyperparameter tuning, and susceptibility to LLM hallucinations during both summarization and triple extraction, particularly in large repositories.
This approach suggests a trajectory where knowledge graph construction becomes increasingly autonomous, adaptable, and domain-specialized, provided advances in efficient LLM deployment and context management continue (Sun et al., 30 May 2025).