- The paper introduces a hierarchical lexical graph framework that organizes document connections into three tiers: lineage, entity-relationship, and summarization.
- It details two retrieval methods—StatementGraphRAG and TopicGraphRAG—that combine keyword extraction, vector search, graph beam search, and reranking for precise multi-hop QA.
- Experimental results demonstrate significant improvements over baseline RAG systems, with enhanced recall and correctness across diverse datasets.
HLG: Hierarchical Lexical Graph for Enhanced Multi-Hop Retrieval
The paper "Hierarchical Lexical Graph for Enhanced Multi-Hop Retrieval" (2506.08074) introduces a novel Hierarchical Lexical Graph (HLG) framework to enhance multi-hop question answering (QA) by grounding LLMs with external evidence. This framework addresses the limitations of traditional Retrieval-Augmented Generation (RAG) systems that struggle with questions requiring synthesis from semantically distant documents. The paper also introduces a synthetic dataset generation pipeline for creating realistic, multi-document question-answer pairs to rigorously evaluate multi-hop retrieval systems.
HLG Framework: Design and Functionality
The HLG framework is designed with three interconnected tiers to enable fine-grained connectivity across documents:
- Lineage Tier: Establishes the foundation of the graph, ensuring traceability and contextual integrity by maintaining metadata such as document origin, date, and author information, as well as sequentially linked text segments.
- Entity-Relationship Tier: Captures relationships between entities and serves as an entry point for structured, keyword-based searches. It includes key entities classified by category and inter-entity relationships, supporting complex aggregation queries.
- Summarization Tier: Links granular facts and statements to broader topics, forming hierarchical semantic units. This tier connects statements to overarching topics, facilitating both local and global reasoning for efficient retrieval strategies in multi-hop QA tasks.
The authors propose two complementary RAG methods that leverage HLG: StatementGraphRAG and TopicGraphRAG. StatementGraphRAG focuses on individual propositions, linking them across documents using the Entity-Relationship Tier and preserving provenance via the Lineage Tier, making it suitable for detailed queries needing high-precision evidence. TopicGraphRAG retrieves clusters of statements from the Summarization Tier, using entity relationships to connect thematically related clusters and relying on the Lineage Tier for source traceability, which is efficient for broader, open-ended, or higher-level queries.
Implementation Details of StatementGraphRAG
The StatementGraphRAG pipeline comprises four steps: keyword-based retrieval, vector similarity search, graph beam search, and reranking. The process is designed to progressively refine search results.
Keyword Retrieval:
Extracts query terms and retrieves statements containing those terms via explicit entity matching. Statements are ranked based on the number of keywords whose entities appear in the statement. The top-k statements form the set Skw.
Vector Similarity Search:
In parallel with keyword retrieval, retrieves statements semantically similar to the query. The top-k statements form the set Svss. The initial candidate set is the union of Skw and Svss, denoted as Sinit.
Graph Beam Search:
Expands the search by exploring multi-hop neighbors in HLG, traversing shared entities, and scoring resulting paths. For a statement s, its neighbors Nbr(s) are defined as statements s′ sharing entities with s. Given a path P=⟨s1,…,sn⟩, an attention-weighted path embedding is computed, and path relevance is scored using cosine similarity. Beam search expands the frontier until no child improves the score or a maximum depth Dmax is reached, keeping only the B highest-scoring paths at each depth. The candidate pool after graph exploration is Sfinal=Sinit∪Sbeam.
Reranking:
Each statement in Sfinal is rescored using a cross-encoder reranker, and the top-k results are returned.
Implementation Details of TopicGraphRAG
TopicGraphRAG integrates top-down (topic-driven) and bottom-up (entity-driven) retrieval to identify and expand thematically relevant information through multi-hop reasoning.
Topic Discovery:
Embed the query to identify high-level topics aligned with user intent and retrieve relevant statements linked to those topics.
Entity Exploration:
Extract query-related keywords and match them with associated entities in the lexical graph, then retrieve their associated statements.
Graph Beam Search:
Merge topic- and entity-related statements, and use beam search to explore additional context in multiple hops, guided by a reranker at each step.
Final Rerank {content} Truncation:
Rescore all candidate statements using a reranker and select the top-k relevant ones.
Synthetic Dataset Generation Pipeline
To address the lack of complex datasets, the paper introduces a synthetic multi-hop summarization pipeline that generates high-quality question-answer pairs. The pipeline consists of four stages:
- Topic Collection: Retrieves semantically related topics from different documents based on a seed topic.
- Chunk Selection: Collects chunks from each relevant topic (3-5 distinct articles) to provide cohesive context.
- Query Generation: Prompts an LLM with diverse chunks to generate a multi-hop question that requires synthesizing information across articles.
- Critiquing and Refinement: Refines and validates queries using a second LLM pass to ensure clarity, coherence, and multi-article coverage.
The authors used the MultiHop-RAG Corpus (Pandey et al., 2024), initially generating 1,173 questions and retaining 674 high-quality queries after filtering. On average, each query spans 4.1 chunks and 3.4 documents, with 4 entities per question and 9 entities per answer.
Experimental Results and Analysis
The paper presents extensive experiments across five datasets: MultiHop-RAG, SEC-10Q, ConcurrentQA, NTSB, and WikiHowQA. The results demonstrate that HLG-based methods outperform naive chunk-based RAG in retrieval recall and correctness. Specifically, SGRAG-0.5% attains the highest average correctness (73.6%), an overall improvement of 7.5 points over the baseline. On multi-answer tasks (NTSB and WikiHowQA), TGRAG yields the best average recall (53.8%), outperforming SGRAG-0.5% by 1.4 absolute points. On the synthetic MultiHop-RAG subset, TGRAG achieves a claim recall of 67.6%, compared to 62.2% for the baseline. Pairwise comparisons show that both SGRAG and TGRAG significantly outperform the chunk-based baselines, winning over 74% of head-to-head comparisons.
Conclusion
The paper "Hierarchical Lexical Graph for Enhanced Multi-Hop Retrieval" (2506.08074) presents a comprehensive framework for enhancing multi-hop QA by leveraging a hierarchical lexical graph. The HLG framework and the proposed StatementGraphRAG and TopicGraphRAG methods effectively address the limitations of traditional RAG systems. The experimental results demonstrate significant performance gains over chunk-based RAG baselines, highlighting the effectiveness of fine-grained and structured retrieval capabilities. The introduction of a synthetic multi-hop summarization pipeline also contributes to the advancement of evaluation methodologies for multi-hop retrieval systems. Future research directions include refining proposition extraction with smaller models, incorporating adaptive multi-hop detection, and expanding domain specialization to further improve the accuracy and efficiency of multi-hop RAG systems in real-world applications.