Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
167 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
42 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

RAKG:Document-level Retrieval Augmented Knowledge Graph Construction (2504.09823v1)

Published 14 Apr 2025 in cs.IR

Abstract: With the rise of knowledge graph based retrieval-augmented generation (RAG) techniques such as GraphRAG and Pike-RAG, the role of knowledge graphs in enhancing the reasoning capabilities of LLMs has become increasingly prominent. However, traditional Knowledge Graph Construction (KGC) methods face challenges like complex entity disambiguation, rigid schema definition, and insufficient cross-document knowledge integration. This paper focuses on the task of automatic document-level knowledge graph construction. It proposes the Document-level Retrieval Augmented Knowledge Graph Construction (RAKG) framework. RAKG extracts pre-entities from text chunks and utilizes these pre-entities as queries for RAG, effectively addressing the issue of long-context forgetting in LLMs and reducing the complexity of Coreference Resolution. In contrast to conventional KGC methods, RAKG more effectively captures global information and the interconnections among disparate nodes, thereby enhancing the overall performance of the model. Additionally, we transfer the RAG evaluation framework to the KGC field and filter and evaluate the generated knowledge graphs, thereby avoiding incorrectly generated entities and relationships caused by hallucinations in LLMs. We further developed the MINE dataset by constructing standard knowledge graphs for each article and experimentally validated the performance of RAKG. The results show that RAKG achieves an accuracy of 95.91 % on the MINE dataset, a 6.2 % point improvement over the current best baseline, GraphRAG (89.71 %). The code is available at https://github.com/LMMApplication/RAKG.

Summary

  • The paper introduces RAKG, a framework integrating LLMs and retrieval augmentation to construct accurate, document-level knowledge graphs.
  • It addresses traditional KGC limitations by overcoming schema rigidity, context length constraints, and hallucinations through LLM-based entity disambiguation and judging.
  • The approach enhances RAG systems and automated knowledge base population by leveraging effective text chunking, vectorization, and context retrieval strategies.

This paper introduces RAKG (Document-level Retrieval Augmented Knowledge Graph Construction), a framework designed to automatically build knowledge graphs (KGs) from individual documents (RAKG:Document-level Retrieval Augmented Knowledge Graph Construction, 14 Apr 2025). It aims to overcome limitations of traditional KGC methods (like complex entity disambiguation and schema rigidity) and LLM-based approaches (like context length limits and hallucinations), particularly for applications enhancing RAG systems like GraphRAG.

Core Problem Addressed:

Existing methods struggle to create comprehensive and accurate KGs from documents. Traditional methods are inflexible, while LLMs face issues with long contexts (forgetting information) and can generate inaccurate facts (hallucinations). This impacts the quality of KGs used downstream, for example, in RAG systems.

RAKG Framework Overview:

RAKG proposes an end-to-end pipeline that uses LLMs and incorporates retrieval augmentation within the KGC process itself. The key idea is to identify preliminary entities ("pre-entities") first and then use these as queries to retrieve relevant context from the document and potentially an existing KG before generating the final relationships.

Implementation Steps:

  1. Document Processing:
    • Chunking: The input document D is split into semantically coherent chunks (e.g., sentences) T. This avoids arbitrary fixed-length splits and provides manageable units for the LLM.
      1
      
      T = DocSplit(D) // Split D into text chunks text_i
    • Vectorization: Each text chunk text_i is vectorized using an embedding model (e.g., BGE-M3) to create a vector store V_T. An optional initial KG (KG') can also be vectorized (V_kg).
      1
      2
      3
      4
      5
      6
      7
      8
      
      # Example using a sentence transformer library
      from sentence_transformers import SentenceTransformer
      model = SentenceTransformer('BAAI/bge-m3')
      text_chunks = split_document_into_sentences(document_text)
      V_T = model.encode(text_chunks)
      # If using an initial KG:
      # initial_kg_nodes = get_node_names_and_types(initial_kg)
      # V_kg = model.encode(initial_kg_nodes)
  2. Pre-Entity Construction:
    • NER per Chunk: An LLM processes each text_i to perform Named Entity Recognition (NER), identifying "pre-entities". The LLM also assigns a type and description to each pre-entity. A chunk-id links the pre-entity to its source chunk.
    • Vectorization: Pre-entities (e.g., combining name and type) are vectorized into V_Pre_entity.
    • Entity Disambiguation:
      • Vector Similarity: Potential duplicate entities are identified by comparing vectors in V_Pre_entity (cosine similarity > threshold).
      • LLM Refinement: The LLM reviews pairs of potentially similar entities to make a final judgment (SameJudge) on whether they refer to the same real-world entity.
      • Merging: Identical entities are merged, combining their associated chunk-ids.
  3. Relationship Network Construction (Core RAG Step): This is done for each unique, disambiguated entity e.
    • Corpus Retrospective Retrieval: Retrieve relevant context for e from the original document chunks T. This involves:
      • Using the chunk-ids directly associated with e.
      • Performing vector similarity search in V_T using e's vector to find other relevant chunks.
        1
        2
        3
        4
        5
        6
        7
        8
        9
        
        # Pseudocode for retrieval
        def retrieve_context(entity, V_T, text_chunks, chunk_ids, vector_index_T):
           context_chunks = [text_chunks[i] for i in chunk_ids[entity]]
           entity_vector = model.encode(entity.name + " " + entity.type)
           similar_chunk_indices = vector_index_T.search(entity_vector, k=5) # Find top 5 similar chunks
           for idx in similar_chunk_indices:
               if text_chunks[idx] not in context_chunks:
                   context_chunks.append(text_chunks[idx])
           return context_chunks
    • Graph Structure Retrieval (Optional): If an initial KG (KG') exists, retrieve similar entities and their existing relationships from KG' using vector search on V_kg.
    • Relationship Generation: Feed the entity e, the retrieved text chunks, and (if applicable) the retrieved graph structures into an LLM prompted to generate relationship triples (e, relation, target_entity).
    • LLM as Judge (Filtering): Crucially, use an LLM to evaluate the generated triples (and entities). The LLM checks if the generated information is supported by the retrieved text context (and potentially the initial KG context). Triples or entities deemed low-fidelity (hallucinated) are filtered out. This directly applies RAG evaluation principles to improve KG quality. (See Figure 4 in the paper).
  4. Knowledge Graph Fusion:
    • Merge the newly generated KG (KG) with the initial KG (KG'), performing entity disambiguation between the two graphs and integrating the relationship sets.

Practical Implications and Applications:

  • Improved KG Quality for RAG: The primary application is creating better KGs to enhance domain-specific RAG systems. By grounding the KG construction in retrieved evidence and filtering hallucinations, RAKG produces KGs that are more faithful to the source documents.
  • Automated Knowledge Base Population: Can be used to automatically extract structured knowledge from large document repositories (e.g., research papers, internal company documents, news articles).
  • Overcoming LLM Limitations: The retrieval step helps LLMs focus on relevant context, mitigating issues with long documents where information might be missed or forgotten by the LLM processing the whole text at once.
  • Reduced Hallucinations: The "LLM as Judge" step provides a practical mechanism to improve the factual accuracy of the generated KG.

Implementation Considerations:

  • LLM Choice: Performance heavily depends on the chosen LLM (e.g., Qwen2.5-72B used in the paper) for NER, disambiguation, relation generation, and judging. Requires powerful models.
  • Embedding Model: The quality of the embedding model (e.g., BGE-M3) is critical for effective retrieval.
  • Computational Cost: Multiple LLM calls (NER per chunk, disambiguation checks, relation generation per entity, judging per entity/relation) make it computationally intensive and potentially slow/expensive.
  • Scalability: Requires efficient vector indexing and retrieval systems for large datasets. Parallel processing of chunks/entities might be necessary.
  • Threshold Tuning: Similarity thresholds for vector retrieval and confidence scores for LLM judgments need careful tuning.
  • Evaluation: While the paper uses metrics like Entity Coverage (EC) and Relation Network Similarity (RNS) against a "standard KG," creating such standard KGs for real-world data can be challenging and subjective. The MINE dataset's accuracy metric (QA-based) is a more practical evaluation approach.
  • Schema Flexibility: The framework appears schema-flexible, relying on the LLM to define entities and relations. This offers adaptability but might require post-processing for schema alignment if a rigid target schema is needed.

Code:

The authors provide code, which is essential for practical adoption: https://github.com/LMMApplication/RAKG

Summary:

RAKG presents a practical, LLM-driven framework for document-level KGC that cleverly incorporates retrieval augmentation during construction to improve context handling and an LLM-based judging step to enhance factual accuracy. It achieves state-of-the-art results on the MINE benchmark by generating KGs that better capture document semantics, are denser, and show higher fidelity compared to baselines like KGGen and GraphRAG. For practitioners building KGs from text, RAKG offers a promising approach, particularly when the goal is to create high-fidelity KGs for downstream RAG applications, provided the computational resources are available.

Github Logo Streamline Icon: https://streamlinehq.com

GitHub

X Twitter Logo Streamline Icon: https://streamlinehq.com