- The paper presents a grounded knowledge graph framework that uses deterministic, interpretable semantic parses for explicit sentence-level grounding, reducing hallucinations.
- The approach leverages both AMR and SRL for structured graph construction and achieves significant ROUGE-L improvements over traditional RAG methods, especially on shorter texts.
- The work enables robust auditing by directly linking retrieval units to source text, addressing limitations of LLM-driven indices and advancing factual accuracy.
GROUNDEDKG-RAG: Grounded Knowledge Graph Index for Long-document Question Answering
Problem Setting and Limitations of Existing RAG Approaches
Retrieval-Augmented Generation (RAG) is widely adopted for long-document question answering tasks to mitigate limitations of LLM context windows. However, standard hierarchical or graph-based RAG approaches—such as RAPTOR and GraphRAG—suffer from substantive drawbacks. These systems depend heavily on LLM-generated chunk descriptions, introducing high resource consumption, redundancy, and hallucinations stemming from insufficient grounding in the original text. Hierarchical summarization further propagates ungrounded errors due to recursive abstraction and results in rigid, query-inflexible indices.
The GroundedKG-RAG Framework
GroundedKG-RAG redefines RAG for long-document QA via explicit document grounding. Its knowledge graph construction pipeline is driven by deterministic, interpretable semantic parses directly linked (grounded) to sentence-level spans of the source text. Graph nodes represent entities and actions, while edges encode semantic (PropBank-style) and temporal (event sequencing) relations. Every node and edge retains pointers to the precise sentences from which they are extracted, resulting in a human-auditable index.
Knowledge graph construction leverages both Semantic Role Labeling (SRL) and Abstract Meaning Representation (AMR) parses. SRL establishes predicate-argument structures, while AMR provides a semantically rich, coreference and abstraction-capable representation. AMR-based graphs outperform SRL-based ones in granularity and coreference handling, leading to more accurate retrieval and reduced duplication.
During querying, the framework transforms the user’s question into the same graph formalism, embeds query nodes, and retrieves the most similar graph nodes and their grounded sentences. Multiple node embedding strategies are implemented: basic node embeddings, local-average neighbor aggregation, and attention-weighted neighbor aggregation (using cosine similarity). Ablation experiments reveal that basic node embeddings yield optimal performance, as context-sensitive embeddings introduce noise without substantial gains.
Retrieval and Filtering Mechanisms
Three retrieval strategies are evaluated: (1) node-based retrieval, (2) text-level vector similarity filtering, and (3) salience-based filtering by retrieval count. Basic node-based retrieval achieves superior recall and precision due to the explicit semantic deep-indexing provided by the GroundedKG structure. Additional vector-level filtering leads to a consistent drop in both BERTScore and ROUGE-L F1, as gold sentences are occasionally pruned due to lexical divergence despite semantic relevance. Salience-based filtering similarly suppresses recall by excluding sentences infrequently attached to retrieved nodes.
Experimental Evaluation
GroundedKG-RAG is benchmarked against recent RAG systems on the NarrativeQA dataset, focusing on full-length narratives across diverse scales (e.g., Peter Rabbit, Phantom of the Opera, Robinson Crusoe). Key findings include:
- GroundedKG-RAG matches or surpasses full-context LLM and GraphRAG performance in Exact Match, Sequence Match, and ROUGE-L, notably on shorter books (Peter Rabbit), while retaining computational efficiency.
- On longer documents, BERTScore marginally trails full-context baselines due to a combinatorially larger and sparser knowledge graph, though ROUGE-L remains competitive.
- Compared with GraphRAG, GroundedKG-RAG achieves higher ROUGE-L (34 vs. 8 on Peter Rabbit), indicating a substantive improvement in factual answer grounding. Hallucinations and redundancy are systematically minimized given the sentence-level grounding.
- Use of AMR parses (versus SRL) improves performance by virtue of superior coreference handling and semantic abstraction, enabling better node correspondence between query and knowledge graph.
Error Analysis
Errors are ascribed to (1) node/edge misconstruction in knowledge graph induction (especially coreference and proper noun abstraction, ameliorated by AMR), (2) retrieval mismatches or insufficient recall owing to limited expressive power of embeddings, and (3) failures in downstream generative answer construction, often when context contains distractor sentences or the LLM fails to prioritize the most relevant span.
Theoretical and Practical Implications
GroundedKG-RAG demonstrates that knowledge graph-based RAG with deterministic grounding obviates critical shortcomings of LLM-centered index construction—most notably hallucination and untraceability. By tying every retrieval unit directly to document sentences, the approach supports robust auditing and forensic error analysis, a necessity for domains requiring high factual precision and interpretability. It also decouples the indexing procedure from LLM generation, allowing efficient scaling to industrial corpora and compatibility with future LLM architectures without re-engineering the index.
Future advances may arise from improved node matching (e.g., leveraging newer AMR/SRL models), more expressive edge representations, and smarter context selection strategies that reduce context fragmentation for generation modules. Additionally, exploration of hybrid retrieval–generation modules that can reason over compositional grounded graphs may raise BERTScore in longer texts.
Conclusion
GroundedKG-RAG presents an interpretable, resource-efficient, and factually robust framework for long-document question answering. Its explicit sentence-level grounding mechanism corrects well-documented deficiencies of prior hierarchical, LLM-driven indices. Empirical results substantiate its competitiveness with much larger, costlier long-context LLMs and recent RAG architectures, especially in ROUGE-L and factual accuracy. Theoretical implications span both improved QA system auditability and a pathway toward more scalable, less hallucination-prone retrieval-augmented LLMs. This work substantiates the value of explicit semantic parses and human-readability as foundational components for trustworthy AI question answering over long-form texts.