- The paper introduces RAKG, a framework integrating LLMs and retrieval augmentation to construct accurate, document-level knowledge graphs.
- It addresses traditional KGC limitations by overcoming schema rigidity, context length constraints, and hallucinations through LLM-based entity disambiguation and judging.
- The approach enhances RAG systems and automated knowledge base population by leveraging effective text chunking, vectorization, and context retrieval strategies.
This paper introduces RAKG (Document-level Retrieval Augmented Knowledge Graph Construction), a framework designed to automatically build knowledge graphs (KGs) from individual documents (RAKG:Document-level Retrieval Augmented Knowledge Graph Construction, 14 Apr 2025). It aims to overcome limitations of traditional KGC methods (like complex entity disambiguation and schema rigidity) and LLM-based approaches (like context length limits and hallucinations), particularly for applications enhancing RAG systems like GraphRAG.
Core Problem Addressed:
Existing methods struggle to create comprehensive and accurate KGs from documents. Traditional methods are inflexible, while LLMs face issues with long contexts (forgetting information) and can generate inaccurate facts (hallucinations). This impacts the quality of KGs used downstream, for example, in RAG systems.
RAKG Framework Overview:
RAKG proposes an end-to-end pipeline that uses LLMs and incorporates retrieval augmentation within the KGC process itself. The key idea is to identify preliminary entities ("pre-entities") first and then use these as queries to retrieve relevant context from the document and potentially an existing KG before generating the final relationships.
Implementation Steps:
- Document Processing:
- Chunking: The input document
D
is split into semantically coherent chunks (e.g., sentences) T
. This avoids arbitrary fixed-length splits and provides manageable units for the LLM.
1
|
T = DocSplit(D) // Split D into text chunks text_i |
- Vectorization: Each text chunk
text_i
is vectorized using an embedding model (e.g., BGE-M3) to create a vector store V_T
. An optional initial KG (KG'
) can also be vectorized (V_kg
).
1
2
3
4
5
6
7
8
|
# Example using a sentence transformer library
from sentence_transformers import SentenceTransformer
model = SentenceTransformer('BAAI/bge-m3')
text_chunks = split_document_into_sentences(document_text)
V_T = model.encode(text_chunks)
# If using an initial KG:
# initial_kg_nodes = get_node_names_and_types(initial_kg)
# V_kg = model.encode(initial_kg_nodes) |
- Pre-Entity Construction:
- NER per Chunk: An LLM processes each
text_i
to perform Named Entity Recognition (NER), identifying "pre-entities". The LLM also assigns a type and description to each pre-entity. A chunk-id
links the pre-entity to its source chunk.
- Vectorization: Pre-entities (e.g., combining name and type) are vectorized into
V_Pre_entity
.
- Entity Disambiguation:
- Vector Similarity: Potential duplicate entities are identified by comparing vectors in
V_Pre_entity
(cosine similarity > threshold).
- LLM Refinement: The LLM reviews pairs of potentially similar entities to make a final judgment (
SameJudge
) on whether they refer to the same real-world entity.
- Merging: Identical entities are merged, combining their associated
chunk-ids
.
- Relationship Network Construction (Core RAG Step): This is done for each unique, disambiguated entity
e
.
- Corpus Retrospective Retrieval: Retrieve relevant context for
e
from the original document chunks T
. This involves:
- Using the
chunk-ids
directly associated with e
.
- Performing vector similarity search in
V_T
using e
's vector to find other relevant chunks.
1
2
3
4
5
6
7
8
9
|
# Pseudocode for retrieval
def retrieve_context(entity, V_T, text_chunks, chunk_ids, vector_index_T):
context_chunks = [text_chunks[i] for i in chunk_ids[entity]]
entity_vector = model.encode(entity.name + " " + entity.type)
similar_chunk_indices = vector_index_T.search(entity_vector, k=5) # Find top 5 similar chunks
for idx in similar_chunk_indices:
if text_chunks[idx] not in context_chunks:
context_chunks.append(text_chunks[idx])
return context_chunks |
- Graph Structure Retrieval (Optional): If an initial KG (
KG'
) exists, retrieve similar entities and their existing relationships from KG'
using vector search on V_kg
.
- Relationship Generation: Feed the entity
e
, the retrieved text chunks, and (if applicable) the retrieved graph structures into an LLM prompted to generate relationship triples (e, relation, target_entity)
.
- LLM as Judge (Filtering): Crucially, use an LLM to evaluate the generated triples (and entities). The LLM checks if the generated information is supported by the retrieved text context (and potentially the initial KG context). Triples or entities deemed low-fidelity (hallucinated) are filtered out. This directly applies RAG evaluation principles to improve KG quality. (See Figure 4 in the paper).
- Knowledge Graph Fusion:
- Merge the newly generated KG (
KG
) with the initial KG (KG'
), performing entity disambiguation between the two graphs and integrating the relationship sets.
Practical Implications and Applications:
- Improved KG Quality for RAG: The primary application is creating better KGs to enhance domain-specific RAG systems. By grounding the KG construction in retrieved evidence and filtering hallucinations, RAKG produces KGs that are more faithful to the source documents.
- Automated Knowledge Base Population: Can be used to automatically extract structured knowledge from large document repositories (e.g., research papers, internal company documents, news articles).
- Overcoming LLM Limitations: The retrieval step helps LLMs focus on relevant context, mitigating issues with long documents where information might be missed or forgotten by the LLM processing the whole text at once.
- Reduced Hallucinations: The "LLM as Judge" step provides a practical mechanism to improve the factual accuracy of the generated KG.
Implementation Considerations:
- LLM Choice: Performance heavily depends on the chosen LLM (e.g., Qwen2.5-72B used in the paper) for NER, disambiguation, relation generation, and judging. Requires powerful models.
- Embedding Model: The quality of the embedding model (e.g., BGE-M3) is critical for effective retrieval.
- Computational Cost: Multiple LLM calls (NER per chunk, disambiguation checks, relation generation per entity, judging per entity/relation) make it computationally intensive and potentially slow/expensive.
- Scalability: Requires efficient vector indexing and retrieval systems for large datasets. Parallel processing of chunks/entities might be necessary.
- Threshold Tuning: Similarity thresholds for vector retrieval and confidence scores for LLM judgments need careful tuning.
- Evaluation: While the paper uses metrics like Entity Coverage (EC) and Relation Network Similarity (RNS) against a "standard KG," creating such standard KGs for real-world data can be challenging and subjective. The MINE dataset's accuracy metric (QA-based) is a more practical evaluation approach.
- Schema Flexibility: The framework appears schema-flexible, relying on the LLM to define entities and relations. This offers adaptability but might require post-processing for schema alignment if a rigid target schema is needed.
Code:
The authors provide code, which is essential for practical adoption: https://github.com/LMMApplication/RAKG
Summary:
RAKG presents a practical, LLM-driven framework for document-level KGC that cleverly incorporates retrieval augmentation during construction to improve context handling and an LLM-based judging step to enhance factual accuracy. It achieves state-of-the-art results on the MINE benchmark by generating KGs that better capture document semantics, are denser, and show higher fidelity compared to baselines like KGGen and GraphRAG. For practitioners building KGs from text, RAKG offers a promising approach, particularly when the goal is to create high-fidelity KGs for downstream RAG applications, provided the computational resources are available.