Multi-hop Question Answering Overview

Updated 17 December 2025

Multi-hop Question Answering is a task requiring integration and reasoning across multiple evidence sources to synthesize precise answers.
Systems leverage various architectures such as pipeline retrieval, graph-based models, and LLM-driven iterative approaches to address challenges like query drift.
Recent advances include label-free retriever training, hypergraph-based semantic fusion, and dynamic attention strategies, enhancing both accuracy and efficiency.

Multi-hop Question Answering (MHQA) is the task of answering natural language questions that require integrating and reasoning over multiple, distinct pieces of information—typically drawn from multiple documents, passages, or knowledge graph nodes—via multi-step reasoning chains. Unlike single-hop QA, where a single evidence span suffices to answer a query, MHQA explicitly targets scenarios where each “hop” in the reasoning process uncovers partial information used to construct or filter subsequent queries, ultimately synthesizing the requisite knowledge path to produce the answer. This paradigm is central to benchmarks such as HotpotQA, 2WikiMultiHopQA, and MuSiQue, and motivates a wide spectrum of neural and symbolic methodology for retrieval, aggregation, and reasoning over heterogeneous textual and structured corpora (Mavi et al., 2022).

1. Formal Task Definition and Taxonomy

Let $\mathcal{C}$ denote a corpus of contexts (documents, passages), $\mathcal{S}$ the set of possible natural language questions, and $\mathcal{A}$ the answer space. An MHQA instance is a tuple $(q, \{c_j\}_{j=1}^n)$ , where $q \in \mathcal{S}$ must be answered by synthesizing information from a set of $k$ ( $k\geq2$ ) supporting contexts $C_p=\{c_1, ..., c_k\} \subseteq \mathcal{C}$ . The general objective is to learn a function

$f: \mathcal{S} \times \mathcal{C}^n \to \mathcal{A} \cup \{\emptyset\}$

such that $f(q, \{c_j\}) \models a$ (entails the correct answer $a$ or detects unanswerable cases). The reasoning path is an ordered sequence $(c_1 \to c_2 \to \ldots \to c_k \to a)$ where each transition represents a deductive or compositional inference step.

MHQA question types are commonly categorized as:

Bridge: Chain composition via an explicit intermediate entity or attribute (e.g., “A's spouse starred in what movie?”).
Comparison: Mapping comparative relations across entities (“Who has more X, Y or Z?”).
Intersection: Satisfying multiple constraints that jointly yield the answer.
Commonsense: Requiring external knowledge or implicit inference (Mavi et al., 2022).

2. Methodological Landscape

MHQA systems encompass a range of architectures with distinct mechanisms for retrieval, reasoning, and aggregation.

2.1 Pipeline and Iterative Architectures

Pipeline (Retrieve-then-Read): Initial candidate selection (BM25, dense retrieval), followed by span extraction or classification via transformers (e.g., RoBERTa, ELECTRA) over concatenated contexts (Yin et al., 2022).
Iterative Retrieval-Augmented Generation (RAG): Sequentially reformulates queries based on intermediate results (thoughts), employing dense retrievers and generative LMs to iteratively retrieve, reason, and requery. Iterative frameworks include IQATR loops and TreeHop, the latter using purely embedding-based update rules for efficient multi-hop traversal (Li et al., 28 Apr 2025).

2.2 Graph-based and Structured Approaches

Graph-based Reasoners: Construct context graphs over extracted entities, sentences, or paragraphs. Reasoning is performed with GNN variants (GCN, GAT), e.g., HGN and GATH, leveraging hierarchical and sequential message passing (He et al., 2023).
Hypergraph & Knowledge Graph Models: HGRAG fuses fine-grained entity and coarse-grained passage similarity via hypergraph diffusion, combining semantic and structural cues to overcome limitations of entity-only or passage-only retrieval (Wang et al., 15 Aug 2025).

2.3 LLM–Centric Approaches

Chain-of-Thought (CoT) and Tree-of-Thought: Prompting LLMs to produce stepwise or tree-structured intermediate reasoning, often combined with retrieval at each step. Stochastic ToT with constrained decoding (STOC-TOT) employs stochastic path confidence estimation and answer-space restriction to improve both correctness and faithfulness (Bi et al., 4 Jul 2024).
Self-Guided Finite-State Machines: SG-FSM enforces a micro-task FSM over prompting, where each sub-component (decomposition, search, revision) has explicit error-checking and auto-correction routines to robustly reduce hallucinations and error propagation (Wang et al., 22 Oct 2024).
Monte-Carlo Tree Search (MCTS)-Augmented LLMs: MZQA casts multi-hop reasoning as a finite-horizon MDP and uses MCTS to explore and select highly rewarding reasoning trajectories without reliance on few-shot exemplars (Lee et al., 28 Sep 2024).

2.4 Retriever Training with Weak/Label-Free Supervision

Dense retrievers, such as those built on BGE or Contriever, are typically trained on manually annotated (query, document) pairs—a bottleneck exacerbated in MHQA by the semantic drift in reformulated queries across hops. ReSCORE introduces label-free dense retriever training by computing LLM-derived pseudo-labels:

Relevance: $P_{LM}(q | d)$ —the probability the LLM reconstructs query $q$ from $d$ .
Consistency: $P_{LM}(a | q, d)$ —the probability the LLM gives answer $a$ when prompted with $(q, d)$ .

The retriever is trained to minimize the KL divergence between the soft LLM label $Q_{LM}(d|q,a) \propto P_{LM}(q|d) \cdot P_{LM}(a|q,d)$ and the retriever's output distribution. This yields label-free, multi-hop–aware retrievers that outperform both BM25 and off-the-shelf dense encoders on MuSiQue, HotpotQA, and 2WikiMHQA (Lee et al., 27 May 2025).

3. Architectural and Methodological Advances

3.1 Label-Free Retriever Training and ReSCORE

ReSCORE’s supervision signal, leveraging both relevance and answer consistency, is essential for multi-hop settings. Empirical ablation demonstrates:

$P_{LM}(q | d)$ alone marginally boosts recall over standard Contriever,
$P_{LM}(a | q, d)$ alone induces many false positives,
The product $P_{LM}(q|d) \cdot P_{LM}(a|q, d)$ yields substantial gains, since it jointly encodes informativeness and answer fidelity (Lee et al., 27 May 2025).

The iterative RAG loop underlying ReSCORE enables the retriever to discover evidence that progressively complements already retrieved context, achieving enhanced multi-step coverage compared to InfoNCE-trained (supervised) retrievers, which can suffer from embedding drift due to tying a single query to multiple diverse gold documents.

3.2 Context Permutation, Masking, and Attention Dynamics

MHQA performance is highly sensitive to:

Order of Supporting Documents: Arranging retrieved passages to mirror the reasoning chain (i.e., forward order) consistently improves exact-match accuracy in both encoder-decoder and causal decoder-only models (e.g., Flan-T5, Qwen 2.5, Llama 3.x).
Masking Strategy: Adding bidirectional attention over input context in decoder-only LMs bridges the gap to encoder-decoder models and reduces susceptibility to passage reordering (Huang et al., 16 May 2025).
Attention Weights: Correct predictions concentrate attention on the last-hop document, while incorrect answers exhibit a more uniform or “diffuse” attention pattern. This property can be exploited heuristically for answer reranking by maximizing peak attention over gold supporting docs.

3.3 Embedding-Level Iteration and Efficiency

TreeHop bypasses costly LLM-assisted query rewriting by operating purely in embedding space:

Query and supporting chunk embeddings are fused through subtraction and a trainable UpdateGate that performs cross-attention, generating next-hop queries.
Simple rule-based pruning (duplicate chunk and layer-wise top- $K$ ) keeps search complexity linear in the number of hops, with a 99% reduction in query latency compared to LLM-based iterative retrieval, while preserving or exceeding multi-hop recall (Li et al., 28 Apr 2025).

3.4 Structural–Semantic Fusion via Hypergraph

HGRAG achieves cross-granularity retrieval by constructing an entity–passage hypergraph and diffusing both entity- and passage-level similarities. The core innovation is to blend dense semantic retrieval with fine-grained structural association via hypergraph Laplacian diffusion and post-hoc semantic/structural enhancement, yielding superior recall and 6 $\times$ –40 $\times$ speedups over prior graph-based RAG systems (Wang et al., 15 Aug 2025).

4. Evaluation Strategies and Performance Trends

Assessment of MHQA systems demands metrics aligned to multi-step reasoning:

Metric	Definition/Mode
Answer EM/F1	Exact match / token F1 on final answer
Supporting Fact EM/F1	EM/F1 on retrieval of gold supporting sentences/facts
Joint EM/F1	Both answer and support facts must be correct in a single prediction
Multi-hop Recall@k	Fraction of gold supporting docs retrieved at each hop, averaged
Reasoning Chain Faithfulness	Rate at which model correctly answers decomposed sub-questions

ReSCORE achieves 23.4/32.7 EM/F1 on MuSiQue, outperforming BM25+Llama-3B and Contriever+Llama by substantial margins in the iterative RAG schema (Lee et al., 27 May 2025). Encoder-decoder models (Flan-T5) consistently outperform parameter-matched decoder-only LMs in MHQA, especially when passage order mirrors reasoning steps (Huang et al., 16 May 2025).

Peak Information Contribution (IC) in LM attention, when used to rerank passage permutations, improves Qwen-7B Answer-Only accuracy from 28.6% to 33.7% on MuSiQue (Huang et al., 16 May 2025).

5. Open Problems and Prospects

Despite substantial advances, several key challenges in MHQA remain:

Query Drift and Reformulation: Large semantic and lexical shifts in queries during multi-hop reasoning complicate retriever and generator alignment across hops.
Label-Efficiency and Scalability: Fully label-free, multi-hop–aware training objectives (e.g., ReSCORE) reduce engineering bottlenecks but require LLM supervision at scale.
Out-of-Domain and Generalization: Current models fine-tuned on a particular corpus or question style may not generalize to unseen MHQA formats or domains (Lee et al., 27 May 2025).
Attention Attribution and Explainability: Bridging token-level attention with explicit reasoning chains remains nontrivial, though grouped attention metrics show promise for debugging and answer validation.
Efficiency/Latency: Embedding-space approaches (TreeHop) and non-iterative hypergraph diffusion (HGRAG) drastically reduce latency but may require corpus-specific tuning for optimal trade-offs (Li et al., 28 Apr 2025, Wang et al., 15 Aug 2025).

6. Best Practices and Empirical Recommendations

Employ encoder-decoder architectures (e.g., Flan-T5) or adapt decoder-only LMs with input-side bidirectional masking for robust multi-hop context aggregation.
Order retrieved supporting documents to match the reasoning chain (forward order) during both finetuning and inference, as this consistently improves accuracy (Huang et al., 16 May 2025).
Exploit LLM-based pseudo-labels that combine both query relevance and answer consistency; this targets both evidence sufficiency and answer faithfulness in retriever optimization (Lee et al., 27 May 2025).
Use grouped attention analysis to monitor model reliance on correct multi-hop chains—peak IC is a reliable saliency signal for gold hops.
Incorporate dynamic, rule-based pruning in embedding-based iterative retrieval systems to balance search coverage and computational cost (Li et al., 28 Apr 2025).

7. Future Research Directions

End-to-end learning of structural-semantic weights in hypergraph retrievers (learned $W_p$ ), dynamic adaptation of diffusion steps, and hierarchical entity type incorporation are open areas (Wang et al., 15 Aug 2025).
Further reductions in LLM supervision cost by distilling pseudo-labels or approximating relevance-consistency signals using lighter-weight teachers (Lee et al., 27 May 2025).
Integration of uncertainty-guided memory and joint retriever–reader training for improved global context utilization (Sagirova et al., 2023).

MHQA remains a driving benchmark for compositional reasoning over distributed evidence. Advances in dense retriever training, hypergraph-based semantic–structural fusion, hierarchical graph attention, and LLM-driven iterative reasoning collectively define the state of the art, yet full generalization, explainability, and efficiency are ongoing research foci.