Multi-hop Question Answering

Updated 18 January 2026

Multi-hop Question Answering is a process that links multiple evidence pieces to answer complex queries through compositional reasoning.
Frameworks such as graph-based reasoning, iterative retrieval, and neural decomposition provide diverse architectures for multi-step inference.
Recent advances emphasize interpretability, robustness, and dynamic adaptation by integrating symbolic, neural, and LLM-based methodologies.

Multi-hop Question Answering (MHQA) is the task of producing an answer to a natural language question that cannot be resolved by a single fact lookup, but instead requires a system to retrieve, connect, and reason over multiple pieces of evidence drawn from diverse and possibly distributed sources. MHQA has emerged as a fundamental challenge at the intersection of natural language understanding, retrieval-augmented generation, and complex logical reasoning. This article surveys formal definitions, methodological frameworks, key architectures, advances in interpretability and evaluation, and empirical insights from recent research, focusing on the technical dimensions and current state-of-the-art.

1. Formal Definition and Problem Scope

Let $\mathcal{S}$ be the set of all questions and $\mathcal{C}$ the universe of contexts (sentences, passages, or knowledge graph triples), with $\mathcal{A}$ the set of possible answers. For a given question $q \in \mathcal{S}$ and a context set $C \subseteq \mathcal{C}$ , the multi-hop QA objective is to compute

$f:\mathcal{S} \times \mathcal{C}^n \longrightarrow \mathcal{A} \cup \{\Phi\}$

such that

$f(q, C) = \begin{cases} a \in \mathcal{A} & \exists P_q = \{p_1, ..., p_k\} \subseteq C,\,|P_q| > 1,\, P_q \models (a \text{ answers } q) \ \Phi & \text{otherwise} \end{cases}$

where $P_q$ is a reasoning chain—a set (or sequence) of supporting facts—whose joint entailment yields the correct answer (Mavi et al., 2022).

MHQA thus generalizes single-hop QA ( $k=1$ ), demanding models capable of multi-step, compositional reasoning over non-local context.

2. Methodological Taxonomy and Model Architectures

MHQA frameworks can be divided into several broad classes, each with specific design principles and empirical trade-offs:

Graph-Based Reasoning

A core family of MHQA models constructs a heterogeneous or hierarchical graph $G=(V,E)$ , where nodes are entities, sentences, paragraphs, or question tokens, and edges encode surface overlap, coreference, or document structure. Information is propagated across this graph using GNNs (Graph Neural Networks), such as GCN, GAT, or specialized architectures like Hierarchical Graph Networks (HGN) (He et al., 2023).

For example, Hierarchical Graph Attention with Hierarchies (GATH) introduces a four-level node hierarchy (question $\mathcal{C}$ 0 paragraph $\mathcal{C}$ 1 sentence $\mathcal{C}$ 2 entity), sequentially updating node representations to mirror human inference steps, and directly connecting the question node to sentence nodes to accelerate multi-hop propagation. The multi-task loss jointly supervises answer span extraction and supporting sentence prediction: $\mathcal{C}$ 3 with level-specific weightings (He et al., 2023).

Iterative Retrieval and Reading

Alternating retrieval and reading models, such as From Easy to Hard (FE2H), mimic the human strategy of first anchoring on the easiest supporting fact (easy) and then conditioning retrieval of subsequent facts (hard) on prior discoveries (Li et al., 2022). FE2H applies document selection and reader modules in two explicit stages, with the first reader initialized by single-hop data and then fine-tuned for multi-hop. This approach outperforms prior baselines in HotpotQA distractor settings.

Models like Semantic Sentence Composition and Reasoning (MSSM+FSC) augment retrieval by sequentially growing the set of supporting sentences and composing them to form inference bridges, leveraging both entity-overlap retrieval and semantic re-ranking to maximize contextual coverage (Chen, 2022).

Programmatic and Neural Decomposition

Question decomposition approaches infer a sequence of single-hop sub-questions from the original complex question. Neural decomposition models automatically learn split points and rewrite rules (e.g., pointer networks), yielding sub-questions that can be answered by single-hop extractors. Sub-answers are then aggregated to form the final answer (Tang et al., 2020, Deng et al., 2022).

Frameworks such as QDAMR use Abstract Meaning Representation (AMR) parsing to segment the question into subgraphs corresponding to sub-questions, generating natural-language sub-questions via a BART-based AMR-to-text generator. Sub-answers are combined based on the reasoning type (bridge, intersection, comparison) (Deng et al., 2022).

Question Generation for Interpretability

An alternative to static decomposition is to generate follow-up questions dynamically as partial information is collected. Malon & Bai propose a model in which intermediate questions are generated via a pointer-generator network based on the initial context and retrieved evidence, each then answered by a pretrained single-hop extractor, resulting in interpretable chains (Malon et al., 2020). "Ask to Understand" integrates a QA branch and a question generation (QG) branch end-to-end, improving both interpretability and accuracy by enforcing alignment via sub-question generation (Li et al., 2022).

LLM-based Modular and Multi-agent Paradigms

Recent advances with LLMs have catalyzed hybrid frameworks that combine multiple reasoning operators (e.g., chain-of-thought, single-step, iterative-step, sub-step, adaptive-step). The BELLE framework (Zhang et al., 17 May 2025) formulates multi-hop QA as a multi-agent, two-level debate: LLM “agents” select and compose operators optimally based on learned question-type-taxonomies, with explicit slots for debate and "judge" agents. This modularization leads to increased performance and cost-effectiveness, with dynamic adaptation to question type.

3. Reasoning Path Faithfulness and Explainability

A fundamental challenge in MHQA is ensuring that answer predictions reflect faithful multi-step reasoning, rather than shortcutting via spurious correlations or dataset artifacts. Work by Tang et al. (Tang et al., 2020) shows that state-of-the-art models often answer the overall multi-hop question correctly while failing on the sub-questions, revealing a deficit in true compositional reasoning. Only about 50–60% of questions with correct final answer had all constituent sub-questions answered correctly.

Prompt-based Conservation Learning (PCL) (Deng et al., 2022) addresses this by freezing a single-hop pretrained model and augmenting it with small, type-conditioned subnetworks—a soft-prompt mechanism—to preserve single-hop capabilities while subsequently learning multi-hop skills. PCL demonstrates significantly reduced drop between final and sub-question F1 scores, indicating mitigation of catastrophic forgetting that typically plagues sequential task transfer.

Monte Carlo Tree Search (PathFinder) (Maram et al., 5 Dec 2025) further enforces faithfulness by explicitly searching diverse reasoning paths, filtering for those with perfect sub-answer recall and LLM "judge" validation, and training LLMs using these high-quality explicit traces.

Explainability is enhanced by frameworks incorporating step-wise question generation and explicit grounding; e.g., “Locate Then Ask” (StepReasoner) grounds each sub-question in identified supporting sentences before advancing to the next hop (Wang et al., 2022). AMR-based QDAMR yields well-formed, context-confirmed sub-questions and captures a range of multi-hop dependencies (Deng et al., 2022).

4. Datasets, Evaluation Protocols, and Human Competence

Table: Exemplary Datasets and Reasoning Patterns

Dataset	Reasoning Types	Supporting Chain Annotation
HotpotQA	Bridge, Comparison	Yes, sentence-level
2WikiMultiHopQA	n-Hop, Entity-chaining	Yes, relation/entity chain
WikiHop	k-Hop KB traversal	Yes, passage list
QASC	Science, 2-hop	Yes, fact-pair alignment
MuSiQue	Multi-hop (2–4)	Yes, explicit steps

Evaluation combines exact match (EM), token-level F1 for span or answer selection, supporting-fact precision/recall (joint and marginal), and, increasingly, sub-question or sub-answer fidelity. Recent work incorporates per-hop accuracy, chain-of-thought faithfulness, and human-in-the-loop auditing.

Su et al. provide a rigorous human benchmark, decomposing MHQA into subtasks—query type recognition, decomposition, single-hop QA, answer integration—and quantifying human abilities per component (Su et al., 6 Oct 2025). Notably, humans excelled at answer integration (97.3% accuracy) but struggled with meta-recognition of multi-hop need (67.9% accuracy), suggesting human–AI system designs should allocate meta-reasoning and decomposition to automated modules while leveraging humans for fact extraction and synthesis.

5. Recent Advances in Robustness, Adaptivity, and Knowledge Editing

Models are increasingly tested against adversarial distractors and noisy contexts, with modalities such as label smoothing (F1 smoothing and Linear Decay Label Smoothing Algorithm, LDLA) showing measurable improvements in HotpotQA EM and F1 (Yin et al., 2022). PathFinder's path-level filtering confers robustness to hallucinations and noisy retrieval (Maram et al., 5 Dec 2025).

Knowledge editing in the multi-hop regime requires that models update all inferentially downstream responses after fact changes, avoiding cascading errors. PokeMQA (Gu et al., 2023) achieves this by decoupling sub-question decomposition from correctness conflict detection, employing a two-stage scope detector external to the LLM. This design allows hundreds or thousands of factual edits to be efficiently incorporated into multi-hop chains without large-scale retraining, preserving both accuracy and chain faithfulness.

6. Directions for Future Research

Key open directions, as articulated by multiple works (Mavi et al., 2022, Zhang et al., 17 May 2025, Maram et al., 5 Dec 2025), include:

Dynamic any-hop architectures capable of adjusting hop count and operator sequence adaptively rather than a priori.
Explainable full-chain prediction, with supervised explicit supporting fact, sub-question, and sub-answer annotation.
Unified performance evaluations spanning answer correctness, chain fidelity, per-hop accuracy, and logical faithfulness.
Improved knowledge integration, especially for implicit connections (commonsense, counts, arithmetic, or comparative reasoning).
Hybrid systems, dynamically aligning agent capabilities (human, retriever, generator, verifier) to the weaknesses and strengths revealed by empirical subtasks (Su et al., 6 Oct 2025).
Extensible editability and fact update, supporting robust and secure integration of dynamic world knowledge (Gu et al., 2023).

Recent evidence demonstrates that integrating explicit reasoning chains, robust retriever–reader pipelines, modular operator planning, and external knowledge management results in both state-of-the-art accuracy and highly interpretable, faithful multi-hop question answering systems. Cumulative advances point to a convergence of symbolic, neural, and interactively organized hybrid paradigms as the leading edge of this foundational task.