Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
41 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Hierarchical Lexical Graph for Enhanced Multi-Hop Retrieval (2506.08074v1)

Published 9 Jun 2025 in cs.IR, cs.AI, and cs.CL

Abstract: Retrieval-Augmented Generation (RAG) grounds LLMs in external evidence, yet it still falters when answers must be pieced together across semantically distant documents. We close this gap with the Hierarchical Lexical Graph (HLG), a three-tier index that (i) traces every atomic proposition to its source, (ii) clusters propositions into latent topics, and (iii) links entities and relations to expose cross-document paths. On top of HLG we build two complementary, plug-and-play retrievers: StatementGraphRAG, which performs fine-grained entity-aware beam search over propositions for high-precision factoid questions, and TopicGraphRAG, which selects coarse topics before expanding along entity links to supply broad yet relevant context for exploratory queries. Additionally, existing benchmarks lack the complexity required to rigorously evaluate multi-hop summarization systems, often focusing on single-document queries or limited datasets. To address this, we introduce a synthetic dataset generation pipeline that curates realistic, multi-document question-answer pairs, enabling robust evaluation of multi-hop retrieval systems. Extensive experiments across five datasets demonstrate that our methods outperform naive chunk-based RAG achieving an average relative improvement of 23.1% in retrieval recall and correctness. Open-source Python library is available at https://github.com/awslabs/graphrag-toolkit.

This paper introduces the Hierarchical Lexical Graph (HLG), a three-tier indexing framework designed to enhance multi-hop retrieval in Retrieval-Augmented Generation (RAG) systems. Traditional RAG often struggles when information needed for an answer is spread across multiple documents with low semantic similarity, relying primarily on vector similarity search which can fail to bridge these "semantic gaps." HLG addresses this by explicitly modeling relationships between fine-grained pieces of information.

The HLG is structured into three interconnected tiers, built once per dataset:

  1. Lineage Tier: Preserves the original document structure and source information. It consists of Source Nodes (metadata like document origin) and Chunk Nodes (sequentially linked text segments from the original documents). This tier is crucial for maintaining traceability and providing context, especially in compliance-sensitive applications.
  2. Entity-Relationship Tier: Captures structured relationships. It includes Entity Nodes (named concepts like "Amazon," categorized) and Relationship Edges between entities (e.g., "FILED LAWSUIT AGAINST"). This tier serves as a structured entry point for keyword-based or complex relationship queries.
  3. Summarization Tier: Connects granular facts to broader themes. It contains Facts (subject-predicate-object triplets), Statements (atomic propositions extracted from text), and Topics (thematic summaries clustering related statements). This tier enables both fine-grained and high-level semantic understanding and navigation.

Graph connectivity within HLG is achieved through a hybrid approach combining semantic similarity and graph traversal. Topics link statements within a document (intra-document), while facts enable links across documents (inter-document). Vector embeddings of topics and statements allow for semantic similarity searches, while matching query keywords to entities enables bottom-up, structure-driven lookups.

Building on HLG, the paper proposes two complementary retrieval models:

  1. StatementGraphRAG (SGRAG): Designed for high-precision retrieval, particularly for factoid questions.
    • It starts with an initial candidate set by combining Keyword-Based Retrieval (matching query entities to statement entities) and Vector Similarity Search (finding statements semantically similar to the query).
    • A Graph Beam Search is then performed from these initial statements. Neighbors are defined as statements sharing entities (Eq. \ref{eq:neighbours}). Paths are scored using an attention-weighted path embedding compared to the query embedding (Eq. \ref{eq:path_embedding}, \ref{eq:beam_score}). The beam search explores multi-hop paths up to a maximum depth DmaxD_{\text{max}}, keeping the top BB paths at each step.
    • Finally, all candidate statements are Reranked using a cross-encoder model to produce the top-kk results. Algorithm \ref{alg:statementgraphrag} provides pseudocode for this process.
  2. TopicGraphRAG (TGRAG): Aimed at broader, exploratory queries needing synthesis across multiple themes.
    • It uses both Top-Down: Topic Discovery (retrieving topics via VSS) and Bottom-Up: Entity Exploration (matching query keywords to entities and retrieving associated statements).
    • Statements from both topic and entity paths are merged, and a Graph Beam Search (similar to SGRAG but potentially starting from statements linked to initial topics/entities) is performed, guided by a reranker at each step.
    • A Final Rerank and Truncation selects the top-kk relevant statements.

The paper also discusses Post-Retrieval Processing steps:

  • Statement Diversity: A filtering step using TF-IDF vectorization and cosine similarity (Section \ref{subsec:statement_diversity}) identifies and prunes near-duplicate statements based on a diversity threshold τ\tau to avoid redundancy and maximize context window utilization.
  • Statement Enhancement: For data with tabular structures (like SEC-10Q), this step appends relevant contextual information (headers, labels) from the original text chunks to numeric propositions, improving clarity and interpretability (Section \ref{subsec:statement_enhancement_tabular}).

To rigorously evaluate multi-hop capabilities, the authors created a Synthetic Multi-Hop Summarization dataset (Section \ref{sec:synthetic_dataset}). Recognizing limitations in existing benchmarks like SEC-10Q (limited queries) and WikiHowQA (often single-hop), their pipeline generates complex, multi-document question-answer pairs. The pipeline involves Topic Collection, Chunk Selection from multiple articles, LLM-based Query Generation requiring cross-document synthesis, and a critical Refinement and Validation stage using a second LLM and automated checks to ensure true multi-hop necessity and sufficient evidence. They generated 674 high-quality queries from the MultiHop-RAG corpus.

The Experimental Setup (Section \ref{sec:experiment_setup}) involved five datasets: MultiHop-RAG, SEC-10Q, ConcurrentQA, NTSB, and WikiHowQA (Table \ref{tab:datasets}). Data indexing followed a four-step process: Chunking, Domain-Adaptive Refinement (using an LLM to infer domain concepts), Proposition/Graph Construction (extracting statements, facts, topics, entities, and relationships), and Embedding Generation (using Cohere-Embed-English-v3). The graph was stored in AWS Neptune Analytics, and embeddings in AWS OpenSearch (with support for other databases).

Evaluation compared three baselines (Naive RAG with VSS, Naive RAG with reranking, Entity-Linking Expansion) against four HLG-based approaches (SRAG, SGRAG, TRAG, TGRAG) and their variants with diversity filtering. The retrieval context window was fixed at 10 units (chunks or statements). A lightweight reranker (BAAI/bge-reranker-v2-minicpm-layerwise) was used across all methods before final selection. Answer generation used Claude-3 Sonnet with identical prompting, focusing evaluation strictly on retrieval effectiveness. Metrics included Correctness (for single-answer tasks) and Answer Recall (for multi-answer tasks), and RAGChecker metrics (Claim Recall, Context Precision) on the synthetic dataset.

Results (Section \ref{sec:results}) showed that HLG-based methods consistently outperformed chunk-based baselines, particularly on true multi-hop datasets.

  • For single-answer tasks, statement-based methods (SGRAG) achieved the highest correctness (e.g., SGRAG-0.5\% averaged 73.6% correctness vs. 66.1% for baseline B1).
  • For multi-answer tasks, topic-based methods (TGRAG) performed best in answer recall (e.g., TGRAG averaged 53.8% recall vs. 50.8% for B1).
  • Graph expansions (SGRAG, TGRAG) generally yielded significant improvements over methods without graph traversal (SRAG, TRAG, baselines).
  • The 0.5% diversity filtering slightly improved performance by reducing redundancy (e.g., SGRAG-0.5% vs SGRAG).
  • Even when the final output was restricted to original chunks (Chunk-SGRAG, Chunk-TGRAG), leveraging statement-level graph traversal internally improved correctness over chunk-only baselines, demonstrating the value of the fine-grained graph structure even for coarse-grained output.
  • On the synthetic dataset, TGRAG showed superior claim recall (67.6%) compared to baselines (up to 62.2%), indicating better gathering of all necessary facts for multi-hop questions. Pairwise comparisons confirmed both SGRAG and TGRAG significantly beat baselines (winning >74% of comparisons).

Limitations and Future Work (Section \ref{sec:discussion_future}) include the high LLM invocation and indexing costs associated with creating the fine-grained graph (though batch processing and caching help), potential performance degradation for single-hop queries where graph expansion is unnecessary, and error modes like over-expansion from highly connected entities, misalignment with numeric/tabular data, and handling of duplicate statements. Future work will focus on optimizing proposition extraction pipelines, exploring hybrid chunk/statement retrieval approaches, and enhancing domain specialization through fine-tuning. Statement extraction accuracy was validated at over 96% faithfulness to original text across datasets (Appendix A). Scalability analysis showed indexing the MultiHop-RAG dataset took under an hour and query latency was reasonable, especially with caching (Appendix B).

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (9)
  1. Abdellah Ghassel (2 papers)
  2. Ian Robinson (6 papers)
  3. Gabriel Tanase (2 papers)
  4. Hal Cooper (5 papers)
  5. Bryan Thompson (3 papers)
  6. Zhen Han (54 papers)
  7. Vassilis N. Ioannidis (34 papers)
  8. Soji Adeshina (13 papers)
  9. Huzefa Rangwala (57 papers)
Github Logo Streamline Icon: https://streamlinehq.com