Knowledge-Enhanced RAG

Updated 26 November 2025

Knowledge-Enhanced Retrieval Augmented Generation (KERAG) is a paradigm that integrates structured knowledge graphs with large language models to enable multi-hop, reasoning-centric question answering.
It employs a three-stage pipeline—scope planning, subgraph retrieval, and chain-of-thought summarization—to systematically filter, retrieve, and aggregate relevant information.
Experimental evaluations show that KERAG yields higher recall and truthfulness while significantly reducing hallucination compared to traditional RAG and KGQA approaches.

Knowledge-Enhanced Retrieval Augmented Generation (KERAG) is an advanced paradigm within Retrieval-Augmented Generation (RAG) that explicitly fuses structured knowledge representations—particularly knowledge graphs (KGs)—with learning-based generative models, chiefly LLMs. By moving beyond isolated passage retrieval, KERAG systems orchestrate broad knowledge subgraph retrieval, fine-grained filtering, and reasoning-centric summarization via LLMs tuned for chain-of-thought (CoT) inference over subgraphs. This integration addresses core limitations of both classical RAG and semantic-parsing-based Knowledge Graph Question Answering (KGQA), offering improved coverage, reduced hallucination, and heightened answer reliability across complex question settings.

1. Pipeline Foundations and System Architecture

At the core of KERAG is a three-stage pipeline that generalizes and structures the RAG-KG interaction. The pipeline can be formalized as follows:

Scope Planning: Identify the central topic entity $E_0$ from query $Q$ , and define a controlled expansion scope (up to $H_{\max}$ hops).
Subgraph Retrieval: Expand $E_0$ ’s neighborhood in the KG, scoring candidate triples $(s, p, o)$ by semantic similarity $S(Q, t) = \cos(\mathbf v_Q, \mathbf v_{(s,p,o)})$ , and retaining those above a threshold $\tau$ .
Filtering and Summarization: Apply schema-aware and LLM-based filtering modules to prune irrelevant edges, then pass the refined subgraph into a CoT fine-tuned LLM for multi-step reasoning and answer generation.

This process is instantiated in KERAG (Sun et al., 5 Sep 2025) as the following pseudocode:

Input: Question Q, Knowledge Graph K, max hops H_max, threshold τ
Output: Answer Ȃ
1: (D, E0) ← ExtractEntityDomain(Q)
2: h ← 1; R̄ ← ∅
3: while h ≤ H_max do
4: Nh ← SchemaNeighbors(E0, h)
5: (R̄_h, cont) ← FilterPlan(Q, Nh)
6: R̄ ← R̄ ∪ R̄_h
7: if cont = STOP then break
8: h ← h+1
9: end while
10: S ← RetrieveSubgraph(E0, h, R̄)
11: Ȃ ← Summarize_CoT(Q, S)
12: return Ȃ

Key module functions:

ExtractEntityDomain: LLM-prompted identification of the scope entity/domain.
SchemaNeighbors: Hop-based schema expansion in the KG.
FilterPlan: LLM/heuristic-driven predicate pruning and early stopping.
RetrieveSubgraph: SPARQL/API-driven subgraph fetch, excluding pruned predicates.
Summarize_CoT: CoT-optimized LLM reasoning over the subgraph.

2. Knowledge Graph Retrieval: Broad Subgraph and Relevance Scoring

Unlike classical KGQA, which typically recovers the minimal path necessary for answer derivation, KERAG retrieves a broader $h$ -hop subgraph around $E_0$ , sharply boosting recall and coverage. The central mathematical formulation for triple relevance is:

$S(Q, t) = \cos(\mathbf{v}_Q, \mathbf{v}_{(s,p,o)})$

where $\mathbf{v}_Q$ is the dense embedding of the query (via models such as DPR), and $\mathbf{v}_{(s,p,o)}$ embeds the candidate triple.

All $t$ with $S(Q, t) \geq \tau$ (e.g., $\tau=0.3$ ) are retained for up to $H_{\max}=3$ hops. This method yields significantly higher retrieval recall compared to path-extraction approaches: e.g., on the CRAG dataset, recall reaches $0.952$ versus $0.844$ for path-based ToG (Sun et al., 5 Sep 2025).

The retrieval loss is formalized as: $L_{\text{retrieval}} = -\sum_{t \in \mathcal{T}^+} \log P(t|Q),\quad P(t|Q) \propto \exp(S(Q, t))$ where positives $\mathcal{T}^+$ are triples from gold answer evidence.

3. Graph-Aware Filtering and Chain-of-Thought Summarization

The retrieved subgraph is further refined by two complementary filtering modules:

LLM-Based Filter: Prompted "skeleton" completions signal irrelevant predicates $\bar R_h$ to prune.
Similarity Filter: Drops all triples with $S(Q, t) < \tau$ at the triple level.

After filtering, the subgraph $S$ is passed to a summarization LLM trained for multi-fact, multi-hop aggregation:

$L_{\text{summarization}} = -\sum_{t=1}^T \log P_\theta(a_t| a_{<t}, S)$

Fine-tuning uses LoRA and Fully-Sharded Data Parallel (FSDP) techniques (e.g., on Llama-3.1-8B). The LLM is prompted to reason stepwise:

Identify relevant triples
Perform aggregation/comparison
Produce the final answer

An example output for aggregation tasks: "(1) sum points for each game… (2) answer = 1952."

4. Chain-of-Thought Reasoning: Data Generation and Fine-Tuning

KERAG employs automatic data generation for supervised CoT fine-tuning:

Generate a CoT trace and answer $\hat A$ with a vanilla LLM.
Compare $\hat A$ to gold answer $A$ using an LLM critic.
Retain only examples where $\hat A = A$ ; discard incorrect traces.

The supervised fine-tuning (SFT) loss function is:

$L_{\text{SFT}} = \sum_{(Q,S,A)} [ -\log P_\theta(A|Q,S) ] + \lambda \|\theta - \theta_0\|^2_2, \quad \lambda=10^{-4}$

This process yields a CoT-robust LLM capable of stepwise subgraph reasoning, factoring in multiple supporting facts and aggregation/comparison operations.

5. Experimental Evaluation Across QA Benchmarks

KERAG has been empirically validated on diverse QA datasets and compared to both standard LLMs and recent KGQA systems.

Key datasets:

CRAG (578 test Qs, API-based QA)
Head2Tail (1,125 Qs, SPARQL on DBpedia)
QALD-10, WebQSP, AdvHotpotQA, CWQ

Metrics include:

Accuracy (A): correct answer rate
Hallucination rate (H): fraction of non-empty, incorrect answers
Miss rate (M): fraction of “I don’t know” outputs
Truthfulness (T = A − H)
F1: $2PR/(P+R)$ for set-type answers

Performance Summary (CRAG):

Model	Acc	Hall	Miss	Truth
GPT-4o	0.341	0.090	0.569	0.251
apex (KDD’24)	0.652	0.194	0.154	0.458
KERAG	0.732	0.202	0.066	0.529

Head2Tail Benchmark:

Model	Acc	Hall	Miss	Truth
WikiSP	0.858	0.066	0.076	0.782
StructGPT	0.895	0.105	0.000	0.790
KERAG	0.908	0.049	0.043	0.860

Ablations highlight that omitting multi-hop expansion (–7.4pp), the filter (–3.9pp), CoT reasoning (–43.1pp), or SFT (–14.4pp) all substantially degrade truthfulness.

KERAG is representative of a class of methods advancing knowledge-enhanced RAG. Complementary designs include:

KG²RAG (Zhu et al., 8 Feb 2025): KG-guided chunk expansion and fact-coherence enforcement in context selection.
LightRAG (Guo et al., 8 Oct 2024): Dual-level (graph + vector) retrieval for improved retrieval efficiency and diversity.
KiRAG (Fang et al., 25 Feb 2025): Iterative triple-level retrieval and reasoning chain construction, with dynamic triple selection per reasoning step.
DO-RAG (Opoku et al., 17 May 2025): Agentic chain-of-thought KG construction, with multimodal graph fusion and grounded answer refinement.
Know³-RAG (Liu et al., 19 May 2025): KG-driven reliability gating for answer verification and adaptive retrieval/generation/filtering.
KG-Infused RAG (Wu et al., 11 Jun 2025): Cognitive spreading activation for KG traversal and summary-based query expansion.
QMKGF (Wei et al., 7 Jul 2025): Multi-path subgraph construction (one-hop, multi-hop, PageRank) and subgraph fusion using attention-based reward modeling.

KERAG's broad subgraph approach and fine-tuned CoT summarization are directly aligned with these trends, but its empirical focus on multi-benchmark truthfulness, coverage, and ablation granularity is distinctive.

7. Limitations and Prospects

Key limitations:

Evaluation is currently restricted to six QA benchmarks, with unknown transferability to other KGs or domains.
Entity linking error propagation can degrade KG retrieval; more accurate pre-linkers are needed.
The fixed hop count may not optimally balance noise vs. coverage; adaptive expansion could optimize information bandwidth.
Despite substantial hallucination reduction, summarizer hallucination persists on rare complex queries (≈2% of QA tasks).

Future improvements may target:

Enhanced entity linking and schema adaptation across KGs.
Learning adaptive expansion strategies (dynamic $H_{\max}$ ) and/or RL-driven filter optimization.
Tighter LLM-KG integration, potentially via graph-aware adapters or multi-modal attention.
Expansion to further domains (e.g., biomedical KGQA, multi-modal KG-RAG) and richer knowledge representations (events, temporal graphs).

In summary, KERAG, and the broader paradigm of knowledge-enhanced RAG, establishes that schema-aware, broad subgraph retrieval, coupled with LLMs fine-tuned for structured reasoning, yields consistently higher recall, truthfulness, and robustness in complex question answering. This approach systematically mitigates the bottlenecks of both unstructured passage retrieval and rigid semantic parsing, forging a scalable template for future knowledge-grounded generative systems (Sun et al., 5 Sep 2025).