SentGraph: Hierarchical Sentence Graph for Multi-hop Retrieval-Augmented Question Answering

Published 6 Jan 2026 in cs.CL and cs.AI | (2601.03014v1)

Abstract: Traditional Retrieval-Augmented Generation (RAG) effectively supports single-hop question answering with LLMs but faces significant limitations in multi-hop question answering tasks, which require combining evidence from multiple documents. Existing chunk-based retrieval often provides irrelevant and logically incoherent context, leading to incomplete evidence chains and incorrect reasoning during answer generation. To address these challenges, we propose SentGraph, a sentence-level graph-based RAG framework that explicitly models fine-grained logical relationships between sentences for multi-hop question answering. Specifically, we construct a hierarchical sentence graph offline by first adapting Rhetorical Structure Theory to distinguish nucleus and satellite sentences, and then organizing them into topic-level subgraphs with cross-document entity bridges. During online retrieval, SentGraph performs graph-guided evidence selection and path expansion to retrieve fine-grained sentence-level evidence. Extensive experiments on four multi-hop question answering benchmarks demonstrate the effectiveness of SentGraph, validating the importance of explicitly modeling sentence-level logical dependencies for multi-hop reasoning.

Abstract PDF Chat (Pro)

Summary

The paper presents a hierarchical sentence graph that uses RST to model logical dependencies, outperforming traditional chunk-based retrieval methods.
It achieves state-of-the-art EM and F1 scores across multiple multi-hop QA datasets while reducing context redundancy.
The framework employs adaptive path expansion and LLM-based refinement to ensure precise, low-latency evidence retrieval during answer generation.

SentGraph: Hierarchical Sentence Graphs for Multi-hop Retrieval-Augmented Question Answering

Introduction

SentGraph (2601.03014) presents a retrieval-augmented generation (RAG) framework focused on multi-hop question answering (QA), where evidence must be aggregated across multiple sentences and documents. Unlike prevalent chunk-based RAG paradigms, SentGraph introduces a hierarchical, sentence-level graph structure grounded in Rhetorical Structure Theory (RST) to model granular logical dependencies between sentences. This enables more precise evidence retrieval and context construction, supporting robust multi-hop reasoning.

Motivation and Background

Chunk-level retrieval, the standard in RAG, fails to provide the fine-grained, logically coherent context required for multi-hop QA. Retrieved chunks tend to include both relevant and irrelevant sentences, which increases context redundancy, risks hallucination, and can lead to missed critical evidence not sufficiently similar to the query. Recent efforts to overcome these limitations rely on post-retrieval refinements or iterative multi-step retrieval, each carrying drawbacks such as excessive latency or over-dependence on initial retrieval quality. Graph-based RAG frameworks have emerged, but predominantly model relationships at the chunk or passage level and thus lack semantic and logical precision.

SentGraph addresses these deficits by reducing the atomic retrieval and reasoning unit to the sentence and organizing these into a structured, multi-layered graph that explicitly encodes intra- and cross-document logical and semantic relations.

Hierarchical Sentence Graph Construction

At the core of SentGraph is a hierarchical graph built offline via a two-stage process: intra-document logic modeling and cross-document bridging.

Figure 1: Comparison between traditional chunk-level graph construction and the sentence-level method used by SentGraph.

Sentences are identified as either "core" (key facts) or "supplementary," using an LLM guided by an adapted RST framework. RST relations are consolidated into a functional set suitable for QA, distinguishing between Nucleus-Nucleus (N-N) (conjunction, contrast, etc.) and Nucleus-Satellite (N-S) (cause, elaboration, etc.) relations. The graph is built as follows:

Topic Layer ( $V_t$ ): Represents document-level semantic entities, serves as bridge nodes for cross-document integration.
Core Sentence Layer ( $V_c$ ): Contains main factual assertions; nodes are densely linked via N-N relations.
Supplementary Sentence Layer ( $V_s$ ): Encodes supporting information, linked to core sentences via N-S relations.

Edges represent intra-document logical dependencies as well as cross-document entity-level bridges, determined using LLM-based entity linking and semantic reasoning.

Figure 2: Overview of the SentGraph framework: hierarchical sentence logic graph construction offline, and graph-guided retrieval plus answer generation online.

Graph-Guided Online Retrieval and Answer Generation

Given a query, online processing proceeds in three phases:

Anchor Selection and Refinement: Top-K nodes are retrieved via a retriever based on similarity; LLM-based refinement filters noise and evaluates sufficiency.
Adaptive Path Expansion: Breadth-first reasoning path expansion is conducted across the graph, aggregating semantically and logically connected evidence until the required context is established.
Answer Generation: Collected sentences form the input context to the LLM, which conducts reasoning and synthesis.

Prompt templates for relation extraction, semantic bridging, and anchor refinement are designed to maximize extraction fidelity and context alignment.

Figure 4: Prompt template for N-N (Nucleus-Nucleus) relations recognition.

Figure 6: Prompt template for N-S (Nucleus-Satellite) relations recognition.

Figure 3: Prompt template for cross-document semantic bridging.

Figure 5: Prompt template for anchor refinement based on query relevance and sufficiency.

Empirical Results

Experiments span four multi-hop QA datasets (HotpotQA, 2WikiMultiHopQA, MuSiQue, MultiHopRAG) under both sparse (BM25) and dense (BGE) retrieval settings, employing a range of LLMs (Llama-3.1-8B, Qwen2.5-7B/14B/32B). SentGraph demonstrates consistent state-of-the-art EM and F1 scores across all datasets and retrieval regimes. Disambiguation and ablation studies confirm the critical importance of sentence-level granularity, logical dependency modeling, and guided path expansion:

Sentence-level retrieval alone significantly outperforms chunk-level retrieval but is insufficient without explicit logical structuring (EM gains of 4–14 points when adding logical modeling over sentence-level retrieval only).
Compared with strong graph-based baselines (e.g., KGP), SentGraph's sentence-granular logical structure achieves notably higher accuracy and efficiency, with lower average input/output token utilization per query (input token reduction up to 45% on some datasets).

Efficiency and Theoretical Implications

SentGraph's graph construction is performed offline, enabling low-latency online inference. Efficiency analysis reveals that this approach not only improves computational tractability during inference but also reduces context redundancy, directly impacting LLM output conciseness and decreasing resource requirements.

Theoretically, the explicit modeling of N-N and N-S RST-based relations asserts an effective mechanism for aligning LLM context windows with the discrete, compositional reasoning patterns required for multi-hop QA—challenging the assumption that improved base LLM capacity alone can close this gap.

Limitations and Future Directions

The current approach is contingent upon the accuracy of LLM-based relation annotation and is sensitive to underlying LLM quality. Extension beyond QA to other discourse-level tasks would require additional adaptation of RST relations and potentially alternate annotation schemes. Future research may address robust graph construction under annotation noise, dynamic/incremental graph adaptation for large corpora, and human-in-the-loop quality assurance.

Conclusion

SentGraph establishes a new methodology for RAG in multi-hop QA via hierarchical sentence-level graph construction with explicit RST-based logical dependencies. Its integration of fine-grained semantic modeling and efficient retrieval yields consistent gains in answer accuracy, context utilization, and computational overhead. The framework has substantial implications for the design of scalable, logically grounded RAG in knowledge-intensive NLP applications.