This paper introduces the Hierarchical Lexical Graph (HLG), a three-tier indexing framework designed to enhance multi-hop retrieval in Retrieval-Augmented Generation (RAG) systems. Traditional RAG often struggles when information needed for an answer is spread across multiple documents with low semantic similarity, relying primarily on vector similarity search which can fail to bridge these "semantic gaps." HLG addresses this by explicitly modeling relationships between fine-grained pieces of information.
The HLG is structured into three interconnected tiers, built once per dataset:
- Lineage Tier: Preserves the original document structure and source information. It consists of Source Nodes (metadata like document origin) and Chunk Nodes (sequentially linked text segments from the original documents). This tier is crucial for maintaining traceability and providing context, especially in compliance-sensitive applications.
- Entity-Relationship Tier: Captures structured relationships. It includes Entity Nodes (named concepts like "Amazon," categorized) and Relationship Edges between entities (e.g., "FILED LAWSUIT AGAINST"). This tier serves as a structured entry point for keyword-based or complex relationship queries.
- Summarization Tier: Connects granular facts to broader themes. It contains Facts (subject-predicate-object triplets), Statements (atomic propositions extracted from text), and Topics (thematic summaries clustering related statements). This tier enables both fine-grained and high-level semantic understanding and navigation.
Graph connectivity within HLG is achieved through a hybrid approach combining semantic similarity and graph traversal. Topics link statements within a document (intra-document), while facts enable links across documents (inter-document). Vector embeddings of topics and statements allow for semantic similarity searches, while matching query keywords to entities enables bottom-up, structure-driven lookups.
Building on HLG, the paper proposes two complementary retrieval models:
- StatementGraphRAG (SGRAG): Designed for high-precision retrieval, particularly for factoid questions.
- It starts with an initial candidate set by combining Keyword-Based Retrieval (matching query entities to statement entities) and Vector Similarity Search (finding statements semantically similar to the query).
- A Graph Beam Search is then performed from these initial statements. Neighbors are defined as statements sharing entities (Eq. \ref{eq:neighbours}). Paths are scored using an attention-weighted path embedding compared to the query embedding (Eq. \ref{eq:path_embedding}, \ref{eq:beam_score}). The beam search explores multi-hop paths up to a maximum depth , keeping the top paths at each step.
- Finally, all candidate statements are Reranked using a cross-encoder model to produce the top- results. Algorithm \ref{alg:statementgraphrag} provides pseudocode for this process.
- TopicGraphRAG (TGRAG): Aimed at broader, exploratory queries needing synthesis across multiple themes.
- It uses both Top-Down: Topic Discovery (retrieving topics via VSS) and Bottom-Up: Entity Exploration (matching query keywords to entities and retrieving associated statements).
- Statements from both topic and entity paths are merged, and a Graph Beam Search (similar to SGRAG but potentially starting from statements linked to initial topics/entities) is performed, guided by a reranker at each step.
- A Final Rerank and Truncation selects the top- relevant statements.
The paper also discusses Post-Retrieval Processing steps:
- Statement Diversity: A filtering step using TF-IDF vectorization and cosine similarity (Section \ref{subsec:statement_diversity}) identifies and prunes near-duplicate statements based on a diversity threshold to avoid redundancy and maximize context window utilization.
- Statement Enhancement: For data with tabular structures (like SEC-10Q), this step appends relevant contextual information (headers, labels) from the original text chunks to numeric propositions, improving clarity and interpretability (Section \ref{subsec:statement_enhancement_tabular}).
To rigorously evaluate multi-hop capabilities, the authors created a Synthetic Multi-Hop Summarization dataset (Section \ref{sec:synthetic_dataset}). Recognizing limitations in existing benchmarks like SEC-10Q (limited queries) and WikiHowQA (often single-hop), their pipeline generates complex, multi-document question-answer pairs. The pipeline involves Topic Collection, Chunk Selection from multiple articles, LLM-based Query Generation requiring cross-document synthesis, and a critical Refinement and Validation stage using a second LLM and automated checks to ensure true multi-hop necessity and sufficient evidence. They generated 674 high-quality queries from the MultiHop-RAG corpus.
The Experimental Setup (Section \ref{sec:experiment_setup}) involved five datasets: MultiHop-RAG, SEC-10Q, ConcurrentQA, NTSB, and WikiHowQA (Table \ref{tab:datasets}). Data indexing followed a four-step process: Chunking, Domain-Adaptive Refinement (using an LLM to infer domain concepts), Proposition/Graph Construction (extracting statements, facts, topics, entities, and relationships), and Embedding Generation (using Cohere-Embed-English-v3). The graph was stored in AWS Neptune Analytics, and embeddings in AWS OpenSearch (with support for other databases).
Evaluation compared three baselines (Naive RAG with VSS, Naive RAG with reranking, Entity-Linking Expansion) against four HLG-based approaches (SRAG, SGRAG, TRAG, TGRAG) and their variants with diversity filtering. The retrieval context window was fixed at 10 units (chunks or statements). A lightweight reranker (BAAI/bge-reranker-v2-minicpm-layerwise) was used across all methods before final selection. Answer generation used Claude-3 Sonnet with identical prompting, focusing evaluation strictly on retrieval effectiveness. Metrics included Correctness (for single-answer tasks) and Answer Recall (for multi-answer tasks), and RAGChecker metrics (Claim Recall, Context Precision) on the synthetic dataset.
Results (Section \ref{sec:results}) showed that HLG-based methods consistently outperformed chunk-based baselines, particularly on true multi-hop datasets.
- For single-answer tasks, statement-based methods (SGRAG) achieved the highest correctness (e.g., SGRAG-0.5\% averaged 73.6% correctness vs. 66.1% for baseline B1).
- For multi-answer tasks, topic-based methods (TGRAG) performed best in answer recall (e.g., TGRAG averaged 53.8% recall vs. 50.8% for B1).
- Graph expansions (SGRAG, TGRAG) generally yielded significant improvements over methods without graph traversal (SRAG, TRAG, baselines).
- The 0.5% diversity filtering slightly improved performance by reducing redundancy (e.g., SGRAG-0.5% vs SGRAG).
- Even when the final output was restricted to original chunks (Chunk-SGRAG, Chunk-TGRAG), leveraging statement-level graph traversal internally improved correctness over chunk-only baselines, demonstrating the value of the fine-grained graph structure even for coarse-grained output.
- On the synthetic dataset, TGRAG showed superior claim recall (67.6%) compared to baselines (up to 62.2%), indicating better gathering of all necessary facts for multi-hop questions. Pairwise comparisons confirmed both SGRAG and TGRAG significantly beat baselines (winning >74% of comparisons).
Limitations and Future Work (Section \ref{sec:discussion_future}) include the high LLM invocation and indexing costs associated with creating the fine-grained graph (though batch processing and caching help), potential performance degradation for single-hop queries where graph expansion is unnecessary, and error modes like over-expansion from highly connected entities, misalignment with numeric/tabular data, and handling of duplicate statements. Future work will focus on optimizing proposition extraction pipelines, exploring hybrid chunk/statement retrieval approaches, and enhancing domain specialization through fine-tuning. Statement extraction accuracy was validated at over 96% faithfulness to original text across datasets (Appendix A). Scalability analysis showed indexing the MultiHop-RAG dataset took under an hour and query latency was reasonable, especially with caching (Appendix B).