Multi-Hop Query Retrieval

Updated 27 February 2026

Multi-hop query retrieval is defined as the process of aggregating distributed pieces of evidence from multiple documents to answer queries that single-step methods cannot resolve.
It employs iterative and compositional approaches such as chain-of-thought, condensed retrieval, and tree-based methods to improve evidence recall and reduce error propagation.
Hybrid strategies including graph-based indexing, metadata filtering, and entailment-aware methods further optimize candidate selection and minimize retrieval latency.

Multi-hop query retrieval is the process of identifying and composing chains of evidence, often scattered across multiple documents or data sources, in order to answer complex queries that cannot be resolved by a single retrieval operation. This task is central in retrieval-augmented generation (RAG) systems supporting multi-hop question answering (QA) and fact verification, and has motivated the development of a diverse range of algorithms, indexing strategies, and evaluation pipelines.

1. Fundamentals and Problem Definition

In multi-hop query retrieval, a query $q$ requires the aggregation of multiple, typically non-contiguous pieces of evidence, frequently from different documents, to produce a valid answer. Formally, given a large collection of candidates (e.g., document chunks, sentences, tables), the goal is to retrieve a sequence or set of items $\{d_1, d_2, ..., d_k\}$ such that, together, they constitute a minimal support chain for $q$ . Standard single-hop retrieval methods, such as dense bi-encoders or BM25 ranking, are inadequate: they are optimized to find the chunk most similar to the query and are unlikely to fetch all required, possibly semantically distant, pieces of evidence.

The failure modes of single-step approaches are highlighted in benchmarking analyses, such as MultiHop-RAG, where even strong pipelines miss 25–50% of essential document hops in complex queries (Tang et al., 2024).

2. Iterative and Compositional Approaches

Early solutions leveraged iterative frameworks that expand or rewrite the retrieval query after each hop, generally using LLMs to generate intermediate questions. These include:

Chain-of-thought with iterative routing: Sequentially reformulate queries as the retrieval context grows (e.g., One-shot RAG, ReAct, IRCoT). However, error propagation, cascading failure from a single bad retrieval, and high latency from repeated LLM calls are prominent issues (Jiapeng et al., 2024).
Condensed retrieval: In "Baleen," after each hop, retrieved passages are summarized to a compact context for the next retrieval, and a focused late-interaction retriever (FLIPR) mitigates embedding bottleneck from long queries (Khattab et al., 2021).
Generative multi-hop retrieval: Instead of dense NN search, an encoder–decoder LLM generates the text of the next retrieval target using constrained decoding; this bypasses the fixed-size vector bottleneck of bi-encoders, improves robustness to error propagation, and reduces memory footprint (Lee et al., 2022).

Recent methodological advances optimize the pipeline:

Tree-based retrieval: The tree-based dynamic iterative retrieval framework (ToR) and Reasoning Tree Guided RAG (RT-RAG) build explicit trees of reasoning paths, expanding each candidate branch independently and using node-accept/reject/expand logic to mitigate the propagation of error (e.g., irrelevant paragraphs) (Jiapeng et al., 2024, Shi et al., 16 Jan 2026).
Embedding-level multi-hop fusion: TreeHop avoids intermediate LLM query rewriting by performing all query updates and retrieval steps in embedding space, using an update-gating mechanism and layer-wise pruning for efficiency (Li et al., 28 Apr 2025).

3. Graph and Structured Indexing Solutions

The challenge of capturing semantic linkage across documents motivated the emergence of multi-tiered graph-based mechanisms:

Hierarchical Lexical Graphs (HLG): Graphs where nodes are atomic propositions, topics, and entities/facts. StatementGraphRAG and TopicGraphRAG use statement-level or topic-level beam search, respectively, to traverse evidence paths via entity overlap and expand coverage in a structured manner. These methods deliver >23% average relative improvements in recall/correctness over flat chunk-based RAG (Ghassel et al., 9 Jun 2025).
Query-centric graph RAG: Constructs two-layer graphs connecting pseudo-queries (generated via Doc2Query or Doc2Query⁻⁻) and text chunks, permitting flexible multi-hop expansion by controlling granularity and connectivity; optimal performance is achieved with carefully tuned graph density and seed sets (Wu et al., 25 Sep 2025).
Ontology-based cube structures: MultiCube-RAG models subjects, attributes, and relations as orthogonal axes of multidimensional cubes, supporting fast sparse lookups and precise sub-query routing, thus offering efficient and inherently explainable retrieval (Shi et al., 11 Feb 2026).
Multi-level graph neural networks: The Query-Specific GNN (QSGNN) framework leverages message passing on entity-, chunk-, and document-level nodes, guided at each step by the query, which yields substantial robustness gains, especially for 4-hop retrieval (Yan et al., 13 Oct 2025).
Multimodal knowledge graphs: In M³KG-RAG, a multi-agent pipeline builds an audio-visual KG with entity-centric triplets. Retrieval is modality-aligned, and grounded pruning ensures that only contextually and semantically supported knowledge is passed to multimodal LLMs (Park et al., 23 Dec 2025).

4. Hybrid and Filtering-based Enhancements

Approaches also focus on refining candidate sets before or during retrieval steps:

Metadata-based filtering: Multi-Meta-RAG uses a small LLM to extract structured metadata filters (e.g., source, date) from the query, restricting initial candidate pools, boosting Hit@K by >15 points, and providing transparent, domain-specific control (Poliakov et al., 2024).
Entailment-aware retrieval: EAR/EARnest ensemble models balance semantic similarity with explicit inference (entailment) similarity, drawing from both STS (BM25, cross-encoder) and NLI models, and further boost retrieval chains with named-entity overlap considerations (Luo et al., 2023).
Noise and memory-aware methods: HANRAG uses a four-component pipeline (router, decomposer/refiner, noise filter, generator) to dynamically route queries, decompose as needed, aggressively filter for relevance, and halt when solution sufficiency is detected, sharply reducing steps and error rates over prior adaptive-RAG systems (Sun et al., 8 Sep 2025). MIND, similarly, combines entity extraction, uncertainty-guided dynamic retrieval, and persistent memory to reduce spurious hops and enhance sample efficiency (Ji et al., 29 Mar 2025).

5. Experimental Results and Benchmarks

The field recognizes that retrieval effectiveness for multi-hop queries cannot be reliably assessed on single-hop or structurally simple benchmarks. Recent works curate complex, multi-document QA sets:

Dataset	Source	Avg. Hops	Key Coverage
MultiHop-RAG	News	2–4	Inference/comparison/temporal/null
HotpotQA	Wikipedia	2	Factoid bridge and comparison
2WikiMultiHopQA	Wikipedia	2–4	Multi-hop, cross-entity
MuSiQue	Wikipedia	2–4	Complex bridging
Synthetic Datasets	Multi-domain	3–4	QA only solvable by evidence fusion

These datasets underpin ablation studies and system comparisons, with typical metrics including Hits@k, Recall@k, Mean Average Precision (MAP), Exact Match, and F1. Across recent benchmarks:

HLG-based (StatementGraphRAG, TopicGraphRAG) and QCG-RAG systems outperform naive RAG by 7–23% relative improvement in recall/correctness (Ghassel et al., 9 Jun 2025, Wu et al., 25 Sep 2025).
Decomposition-free dense retrievers such as GRITHopper-7B match or surpass decomposition-based systems, especially on deeper hops and in zero-shot settings (Erker et al., 10 Mar 2025).
Multi-hop table retrieval (MURRE) raises exact-recall in text-to-SQL tasks by 6 points over the best dense and hallucination-tolerant baselines by rewriting sub-queries to exclude previously found tables (Zhang et al., 2024).
Tree-based and tree-guided approaches (ToR, RT-RAG) deliver up to +13 point F1/recall gains and demonstrably reduce error cascade effects observed in sequential or chain-based methods (Jiapeng et al., 2024, Shi et al., 16 Jan 2026).

6. Current Challenges and Future Directions

Despite progress, practical multi-hop retrieval remains limited by several factors:

Embedding bottleneck: As queries are concatenated with growing context (evidence from earlier hops), vector encoders reach representational limits, as shown in GMR ablation studies (Lee et al., 2022).
Propagation of retrieval errors: Error at an early hop often makes downstream hops unrecoverable. Tree-based, memory-aware, and filtering-augmented frameworks are active areas of research to reduce this sensitivity (Jiapeng et al., 2024, Ji et al., 29 Mar 2025).
Representation and traversal cost: Graph-based systems introduce additional engineering complexity and computational demands, though sparse cube or level-wise pruning can mitigate this (Shi et al., 11 Feb 2026, Li et al., 28 Apr 2025).
Dataset and evaluation limitations: Most existing benchmarks are insufficiently complex, and there is a trend toward generating large synthetic multi-hop corpora for robust evaluation (Ghassel et al., 9 Jun 2025).
Extensibility across modalities: Audio-visual and multimodal multi-hop retrieval is nascent. M³KG-RAG presents a generalized method for constructing and reliably traversing multimodal multi-hop KGs, yielding large, consistent gains over prior RAGs (Park et al., 23 Dec 2025).

Anticipated future work includes dynamic orchestration agents for planning querying graphs, improved query decomposition and error detection, integration with structured metadata and ontologies for more precise candidate filtering, and practical adaptation of high-hop retrieval to large-scale, real-world tasks and scientific domains (Tang et al., 2024, Shi et al., 11 Feb 2026).