KGGen: Open-Source KG Extraction Tool

Updated 7 December 2025

KGGen is an open-source toolkit that uses LLM-based extraction and iterative clustering to create dense, high-quality knowledge graphs from unstructured text.
It employs a three-stage process—generation, aggregation, and clustering—to merge semantically equivalent entities and relations efficiently.
The tool outperforms traditional methods, validated by the MINE benchmark, and supports downstream tasks like KG embedding and retrieval-augmented generation.

KGGen is an open-source Python toolkit designed to automate the extraction of high-quality knowledge graphs (KGs) from unstructured text using LLMs and an iterative clustering approach. It aims to address the data scarcity and low density issues characteristic of conventional KGs and extraction pipelines, providing dense, well-connected graphs for downstream graph-based AI applications. KGGen integrates novel clustering techniques to collapse semantically equivalent entities and relations, outputting KGs suitable for tasks such as KG embedding and retrieval-augmented generation (RAG). The tool is complemented by the MINE benchmark, a metric for evaluating text-to-KG extractors at the level of factual coverage in graph outputs (Mo et al., 14 Feb 2025).

1. Motivation and Context

The development of large-scale, high-quality KGs is hindered by the scarcity of comprehensive, machine-readable graph data. Existing KGs such as Wikidata, DBpedia, and YAGO are predominantly constructed through manual annotation or rule-based extraction, resulting in limited coverage, domain restriction, and susceptibility to noise when automated. These limitations are particularly detrimental in domains with emerging or specialized content, where annotated KGs do not exist.

Downstream AI systems, including graph embedding models (e.g., TransE) and graph-based RAG architectures (e.g., GraphRAG), require dense, highly interconnected KGs for effective representation learning, link prediction, and reliable contextualization in generative tasks. Sparse or incomplete graphs degrade model performance and induce hallucinations. KGGen was engineered to facilitate large-scale, arbitrary text-to-KG conversion; its core objectives include democratizing KG generation, leveraging LMs for extraction, and performing dense clustering to ameliorate graph sparsity.

2. Architecture and Workflow

KGGen’s architecture is modular, comprising three principal stages: generation, aggregation, and clustering.

2.1 Entity and Relation Extraction (“generate”)

The extraction stage ingests raw text and utilizes an LLM (e.g., GPT-4o via DSPy) for a two-pass process involving:

Entity extraction: The model lists all significant entities—nouns, verbs, adjectives—in JSON format with instruction-driven prompting.
Relation extraction: The model, given the enumerated entities and original text, outputs triples (subject, predicate, object) in JSON, constrained to brief (1–3 word) predicates.

This bifurcated approach ensures explicit enumeration and accurate mapping of relations, minimizing omission and misalignment.

2.2 Aggregation (“aggregate”)

Aggregation collates triples across multiple documents, normalizes casing, merges duplicate entities and exact relations, resulting in a raw KG $G_0 = (V, E)$ with node and edge sets. This deterministic step does not invoke LLM computation.

2.3 Entity and Relation Clustering (“cluster”)

Clustering is KGGen’s central innovation: an iterative, LLM-guided loop identifies and merges semantically equivalent entities and relation types. This process collapses variant forms (“labors” vs. “labor”), synonyms, acronyms (“NYC” vs. “New York City”), and paraphrases, distilling redundant nodes and predicates into dense clusters. Clustering applies to both entities and predicates using a pseudocode-guided loop and repeated LLM validation/proposal calls, producing a low-redundancy graph appropriate for modern embedding and retrieval models.

3. Package Installation and API Usage

KGGen is distributed via PyPI and GitHub.

3.1 Installation

1	pip install kg-gen

3.2 Basic API Usage

from kg_gen import KGGen

kg_tool = KGGen(
    model="gpt-4o",             # DSPy-supported LM
    temperature=0.0,            # deterministic decoding
    max_tokens=1024
)

document = """Your unstructured text here..."""
kg_tool.generate(document)      # Extraction
kg_tool.aggregate()             # Merges duplicates
kg_tool.cluster()               # Clusters entities & edges

triples = kg_tool.triples
clusters = kg_tool.entity_clusters

3.3 Visualization

import networkx as nx
import matplotlib.pyplot as plt

G = nx.DiGraph()
for s, p, o in triples:
    G.add_edge(s, o, label=p)

pos = nx.spring_layout(G)
nx.draw(G, pos, with_labels=True, node_size=500, font_size=8)
edge_labels = nx.get_edge_attributes(G, 'label')
nx.draw_networkx_edge_labels(G, pos, edge_labels=edge_labels, font_size=6)
plt.show()

4. MINE Benchmark: Evaluation of Extraction Fidelity

MINE (Measure of Information in Nodes and Edges) is introduced to quantify the factual coverage of automatically extracted KGs.

4.1 Metric Definition

For $N=100$ articles with 15 ground-truth facts $\{f_{i1},...,f_{i15}\}$ per article:

Encode facts and KG nodes via sentence transformer embeddings.
Retrieve top- $k$ similar nodes by cosine similarity.
Expand to all nodes within two hops.
Employ an LLM judge to determine if a fact is inferable from the subgraph.

The MINE score is computed as:

$\text{MINE} = \frac{1}{N\cdot 15}\sum_{i=1}^N\sum_{j=1}^{15} m_{ij}\times 100\%$

where $m_{ij}=1$ if fact $f_{ij}$ is captured.

4.2 Benchmark Protocol and Results

Extractor	Average MINE (%)
OpenIE	29.84
GraphRAG	47.80
KGGen	66.07

KGGen surpasses GraphRAG and OpenIE by 18.3 and 36.2 percentage points, respectively, highlighting its enhanced capacity for salient fact extraction and reduction of graph sparsity.

5. Practical Considerations and Extensions

Practical usage of KGGen involves several domain- and LM-configuration-specific parameters:

Extraction prompts should be concise; few-shot examples are beneficial for specialized terminology.
Clustering instructions can be tailored for domain-specific synonymy (e.g., chemical or medical terms).
Deterministic LM decoding (low temperature, top_p $\approx1.0$ ) is recommended for consistency.
Partition input corpora to <1,000 tokens per chunk for long-form documents.

Foreseeable extensions include:

Incorporation of embedding similarity thresholds alongside LM clustering proposals to mitigate erroneous merges (e.g., cosine > 0.9).
Expansion of the MINE benchmark for larger corpora, such as books or multi-document datasets.
Reduction of LLM invocation costs via specialized, distilled models for extraction and clustering.
Integration of human-in-the-loop verification for ambiguous clusters could further enhance precision.

A plausible implication is that KGGen’s LM-driven clustering methodology, when coupled with active learning or domain adaptation, may further accelerate KG construction for highly specialized scientific domains.

6. Impact and Limitations

KGGen provides an accessible, robust platform for the on-demand construction of dense KGs from plain text, directly addressing traditional pitfalls of manual curation, rule-based extraction, and graph sparsity. The tool’s superior benchmark performance validates its efficacy as a source of graph data for foundation models and RAG systems. Limitations include potential over- or under-clustering by LLMs, computational inefficiency at scale, and benchmark focus on mid-sized articles. Future research may investigate hybrid methods integrating embedding-based decision rules and human curation to optimize KG accuracy and scaling efficiency (Mo et al., 14 Feb 2025).

PDF Markdown Chat (Pro)

References (1)

KGGen: Extracting Knowledge Graphs from Plain Text with Language Models (2025)

Whiteboard

Generate a whiteboard explanation of this topic.

Follow Topic

Get notified by email when new papers are published related to KGGen Tool.