Papers
Topics
Authors
Recent
2000 character limit reached

Hierarchical Lexical Graph for Enhanced Multi-Hop Retrieval

Published 9 Jun 2025 in cs.IR, cs.AI, and cs.CL | (2506.08074v1)

Abstract: Retrieval-Augmented Generation (RAG) grounds LLMs in external evidence, yet it still falters when answers must be pieced together across semantically distant documents. We close this gap with the Hierarchical Lexical Graph (HLG), a three-tier index that (i) traces every atomic proposition to its source, (ii) clusters propositions into latent topics, and (iii) links entities and relations to expose cross-document paths. On top of HLG we build two complementary, plug-and-play retrievers: StatementGraphRAG, which performs fine-grained entity-aware beam search over propositions for high-precision factoid questions, and TopicGraphRAG, which selects coarse topics before expanding along entity links to supply broad yet relevant context for exploratory queries. Additionally, existing benchmarks lack the complexity required to rigorously evaluate multi-hop summarization systems, often focusing on single-document queries or limited datasets. To address this, we introduce a synthetic dataset generation pipeline that curates realistic, multi-document question-answer pairs, enabling robust evaluation of multi-hop retrieval systems. Extensive experiments across five datasets demonstrate that our methods outperform naive chunk-based RAG achieving an average relative improvement of 23.1% in retrieval recall and correctness. Open-source Python library is available at https://github.com/awslabs/graphrag-toolkit.

Summary

  • The paper introduces a hierarchical lexical graph framework that organizes document connections into three tiers: lineage, entity-relationship, and summarization.
  • It details two retrieval methods—StatementGraphRAG and TopicGraphRAG—that combine keyword extraction, vector search, graph beam search, and reranking for precise multi-hop QA.
  • Experimental results demonstrate significant improvements over baseline RAG systems, with enhanced recall and correctness across diverse datasets.

HLG: Hierarchical Lexical Graph for Enhanced Multi-Hop Retrieval

The paper "Hierarchical Lexical Graph for Enhanced Multi-Hop Retrieval" (2506.08074) introduces a novel Hierarchical Lexical Graph (HLG) framework to enhance multi-hop question answering (QA) by grounding LLMs with external evidence. This framework addresses the limitations of traditional Retrieval-Augmented Generation (RAG) systems that struggle with questions requiring synthesis from semantically distant documents. The paper also introduces a synthetic dataset generation pipeline for creating realistic, multi-document question-answer pairs to rigorously evaluate multi-hop retrieval systems.

HLG Framework: Design and Functionality

The HLG framework is designed with three interconnected tiers to enable fine-grained connectivity across documents:

  • Lineage Tier: Establishes the foundation of the graph, ensuring traceability and contextual integrity by maintaining metadata such as document origin, date, and author information, as well as sequentially linked text segments.
  • Entity-Relationship Tier: Captures relationships between entities and serves as an entry point for structured, keyword-based searches. It includes key entities classified by category and inter-entity relationships, supporting complex aggregation queries.
  • Summarization Tier: Links granular facts and statements to broader topics, forming hierarchical semantic units. This tier connects statements to overarching topics, facilitating both local and global reasoning for efficient retrieval strategies in multi-hop QA tasks.

The authors propose two complementary RAG methods that leverage HLG: StatementGraphRAG and TopicGraphRAG. StatementGraphRAG focuses on individual propositions, linking them across documents using the Entity-Relationship Tier and preserving provenance via the Lineage Tier, making it suitable for detailed queries needing high-precision evidence. TopicGraphRAG retrieves clusters of statements from the Summarization Tier, using entity relationships to connect thematically related clusters and relying on the Lineage Tier for source traceability, which is efficient for broader, open-ended, or higher-level queries.

Implementation Details of StatementGraphRAG

The StatementGraphRAG pipeline comprises four steps: keyword-based retrieval, vector similarity search, graph beam search, and reranking. The process is designed to progressively refine search results.

Keyword Retrieval:

Extracts query terms and retrieves statements containing those terms via explicit entity matching. Statements are ranked based on the number of keywords whose entities appear in the statement. The top-k statements form the set Skw\mathcal{S}_{\text{kw}}.

Vector Similarity Search:

In parallel with keyword retrieval, retrieves statements semantically similar to the query. The top-k statements form the set Svss\mathcal{S}_{\text{vss}}. The initial candidate set is the union of Skw\mathcal{S}_{\text{kw}} and Svss\mathcal{S}_{\text{vss}}, denoted as Sinit\mathcal{S}_{\text{init}}.

Graph Beam Search:

Expands the search by exploring multi-hop neighbors in HLG, traversing shared entities, and scoring resulting paths. For a statement ss, its neighbors Nbr(s)\operatorname{Nbr}(s) are defined as statements ss' sharing entities with ss. Given a path P=s1,,snP=\langle s_1,\dots,s_n\rangle, an attention-weighted path embedding is computed, and path relevance is scored using cosine similarity. Beam search expands the frontier until no child improves the score or a maximum depth DmaxD_{\text{max}} is reached, keeping only the BB highest-scoring paths at each depth. The candidate pool after graph exploration is Sfinal=SinitSbeam\mathcal{S}_{\text{final}} = \mathcal{S}_{\text{init}} \cup \mathcal{S}_{\text{beam}}.

Reranking:

Each statement in Sfinal\mathcal{S}_{\text{final}} is rescored using a cross-encoder reranker, and the top-k results are returned.

Implementation Details of TopicGraphRAG

TopicGraphRAG integrates top-down (topic-driven) and bottom-up (entity-driven) retrieval to identify and expand thematically relevant information through multi-hop reasoning.

Topic Discovery:

Embed the query to identify high-level topics aligned with user intent and retrieve relevant statements linked to those topics.

Entity Exploration:

Extract query-related keywords and match them with associated entities in the lexical graph, then retrieve their associated statements.

Graph Beam Search:

Merge topic- and entity-related statements, and use beam search to explore additional context in multiple hops, guided by a reranker at each step.

Final Rerank {content} Truncation:

Rescore all candidate statements using a reranker and select the top-k relevant ones.

Synthetic Dataset Generation Pipeline

To address the lack of complex datasets, the paper introduces a synthetic multi-hop summarization pipeline that generates high-quality question-answer pairs. The pipeline consists of four stages:

  1. Topic Collection: Retrieves semantically related topics from different documents based on a seed topic.
  2. Chunk Selection: Collects chunks from each relevant topic (3-5 distinct articles) to provide cohesive context.
  3. Query Generation: Prompts an LLM with diverse chunks to generate a multi-hop question that requires synthesizing information across articles.
  4. Critiquing and Refinement: Refines and validates queries using a second LLM pass to ensure clarity, coherence, and multi-article coverage.

The authors used the MultiHop-RAG Corpus (Pandey et al., 2024), initially generating 1,173 questions and retaining 674 high-quality queries after filtering. On average, each query spans 4.1 chunks and 3.4 documents, with 4 entities per question and 9 entities per answer.

Experimental Results and Analysis

The paper presents extensive experiments across five datasets: MultiHop-RAG, SEC-10Q, ConcurrentQA, NTSB, and WikiHowQA. The results demonstrate that HLG-based methods outperform naive chunk-based RAG in retrieval recall and correctness. Specifically, SGRAG-0.5% attains the highest average correctness (73.6%), an overall improvement of 7.5 points over the baseline. On multi-answer tasks (NTSB and WikiHowQA), TGRAG yields the best average recall (53.8%), outperforming SGRAG-0.5% by 1.4 absolute points. On the synthetic MultiHop-RAG subset, TGRAG achieves a claim recall of 67.6%, compared to 62.2% for the baseline. Pairwise comparisons show that both SGRAG and TGRAG significantly outperform the chunk-based baselines, winning over 74% of head-to-head comparisons.

Conclusion

The paper "Hierarchical Lexical Graph for Enhanced Multi-Hop Retrieval" (2506.08074) presents a comprehensive framework for enhancing multi-hop QA by leveraging a hierarchical lexical graph. The HLG framework and the proposed StatementGraphRAG and TopicGraphRAG methods effectively address the limitations of traditional RAG systems. The experimental results demonstrate significant performance gains over chunk-based RAG baselines, highlighting the effectiveness of fine-grained and structured retrieval capabilities. The introduction of a synthetic multi-hop summarization pipeline also contributes to the advancement of evaluation methodologies for multi-hop retrieval systems. Future research directions include refining proposition extraction with smaller models, incorporating adaptive multi-hop detection, and expanding domain specialization to further improve the accuracy and efficiency of multi-hop RAG systems in real-world applications.

Paper to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Collections

Sign up for free to add this paper to one or more collections.

Tweets

Sign up for free to view the 3 tweets with 9 likes about this paper.