Single-hop & Multi-hop QA

Updated 16 May 2026

Single-hop QA is defined as extracting an answer from a single evidence passage using direct span prediction, emphasizing reading comprehension with models like BERT.
Multi-hop QA requires synthesizing multiple evidence sources through iterative retrieval, decomposition, and graph-based reasoning to answer complex questions.
Both approaches face challenges such as error propagation, semantic misalignment, and reliance on adversarially constructed datasets, guiding innovations in model architectures and evaluation.

Single-hop and multi-hop question answering (QA) represent distinct paradigms in automated and human information seeking and reasoning, distinguished by the number of inference steps—or “hops”—required to integrate evidence sources and produce an answer. In single-hop QA, an answer is derived from a single atomic fact retrieved from one passage or source; in multi-hop QA, the answer emerges from synthesizing two or more facts—often spread across multiple documents—via a chain of intermediate reasoning comprising reading comprehension, logical inference, and knowledge integration. The latter imposes higher cognitive and algorithmic demands, introduces unique sources of error, and has catalyzed a wealth of research on pipelines, model architectures, dataset construction, and evaluation methodologies.

1. Core Definitions and Distinctions

Single-hop QA entails mapping a question $q$ and a local evidence context $C$ (typically a paragraph or document) to an answer $a$ such that $a$ can be extracted directly from $C$ or inferred in a single reasoning step. Canonical datasets include SQuAD, NewsQA, and TriviaQA. Formal task definition: $a^* = \arg\max_{a \subseteq C} P(a \mid q, C)$ Single-hop extraction and span prediction dominate; the challenge is primarily reading comprehension.

Multi-hop QA requires traversing a reasoning chain involving $T \geq 2$ hops:

$Q_1 \rightarrow Q_2 \rightarrow \dots \rightarrow Q_T$
Each $Q_t$ is a sub-question whose resolution requires information from the previous hop's answer $a_{t-1}$ or supporting fact.
The answer $C$ 0 synthesizes facts from at least two distinct contexts.

Examples:

Single-hop: “Who wrote The Great Gatsby?”
Multi-hop: “Where was the spouse of the third Prime Minister of Canada born?”

Multi-hop QA tasks necessitate decomposition (splitting the complex query into simpler sub-questions), iterative retrieval, reasoning over distributed evidence, and final answer synthesis (Su et al., 6 Oct 2025, Min et al., 2019, Trivedi et al., 2021).

Key differences:

Cognitive/algorithmic load: Multi-hop requires planning and chaining, error compounding over hops, and complex integration.
System design: Single-hop architectures rely on span extractors; multi-hop pipelines require iterative retrieval, decomposition, knowledge integration, and mechanisms to track intermediate entities or facts.

2. Dataset Construction and Benchmarking

Early multi-hop QA benchmarks (e.g., HotpotQA, 2WikiMultihopQA) attempted to enforce compositional reasoning by crowdsourcing questions designed to require multiple paragraphs. Analyses revealed, however, that compositionality does not guarantee genuine multi-hop reasoning; single-hop models can answer a large fraction of nominally multi-hop examples due to weak distractors or redundant evidence (Min et al., 2019). For instance, a BERT-based single-hop model achieves 67.1 F1 on HotpotQA’s distractor validation set—rivaling multi-hop-specific models.

MuSiQue-Ans introduced a bottom-up construction paradigm: single-hop RC instances are composed into directed acyclic graphs, guaranteeing that every hop’s answer depends on a predecessor by masking sub-answers and controlling context overlap. This filtering—combined with adversarial distractors and unanswerable contrast pairs (MuSiQue-Full)—enforces connected multi-hop reasoning and virtually eliminates shortcut solutions, exposing a larger human–machine gap (Trivedi et al., 2021).

In the scientific domain, AIM-SciQA extracts hundreds of thousands of single-hop QAs from PMC articles and composably links them into over 13,000 multi-hop pairs—using both semantic similarity and explicit citation graphs—to differentiate retrieval and synthesis abilities under both oracle and realistic settings (Lee et al., 15 Mar 2026).

3. Model Architectures: From Single-hop to Multi-hop

Single-hop QA: Span-extraction models (BiDAF, BERT- or ELECTRA-based) dominate, leveraging token-level attention and direct answer prediction based on the concatenated question; context.

Multi-hop QA:

Pipeline approaches: Iterative retrieval and reasoning pipelines, such as paragraph retrievers coupled with multi-task readers (answer and supporting fact prediction), can be trained with separate losses at each hop [(Feldman et al., 2019)].
Graph-based models: Graph neural networks (GNNs) and entity-centric architectures explicitly model relationships between entities, sentences, and documents to enable evidence aggregation (Li et al., 2022, Jiang et al., 2019, Ramesh et al., 2023).
Modular and decomposable networks: Self-assembling modular networks assemble “Find,” “Relocate,” “Compare,” and “NoOp” primitives dynamically per-question, with controllers softly decomposing queries into interpretable sub-questions (Jiang et al., 2019).
Generative and sequence-prediction models: Fusion-in-Decoder (FiD), PathFID, and SEQGRAPH model not only the final answer but also the stepwise reasoning path, outputting linearized sequences over passage titles, sentence indices, and answers, with integrated graph constraints enhancing faithfulness and interpretability (Yavuz et al., 2022, Ramesh et al., 2023).
Prompt-based and parameter-conserving models: Prompt-based Conservation Learning (PCL) freezes single-hop model backbones, appending learnable, type-specific prompts and lateral expansion to encode distinct multi-hop reasoning patterns without catastrophic forgetting of single-hop skills (Deng et al., 2022).

End-to-end question-generation (QG) models jointly learn to ask and answer explicit sub-questions as latent variables, improving both interpretability and answer robustness while mitigating error cascades inherent in straightforward question decomposition pipelines (Li et al., 2022, Malon et al., 2020).

4. Human and Machine Performance Characteristics

A comprehensive human study (Su et al., 6 Oct 2025) revealed the following:

Task	Human Accuracy (mean)
Single-hop QA	84.1% [76.9–89.7%]
Direct Multi-hop QA	80.2% [70.1–91.9%]
Query-Type Recognition	67.9% [51.3–77.8%]
Answer Integration	97.3% [94.1– — ]

Key findings:

Humans excel at knowledge integration (97.3%) but often fail to recognize whether a question requires multi-hop reasoning (68%).
Semantic-type mismatch, entity confusion, omission of required integration steps in decomposition, and rare synthesis errors are prevalent error types.
AI and hybrid AI–human systems benefit by assigning sub-tasks based on these strengths and weaknesses: machines automate complexity detection and decomposition; humans focus on nuanced reading and integration.

Multi-hop models, even when correctly answering the global question, often fail to answer explicit sub-questions as evaluated by diagnostic metrics—revealing that shortcut solutions persist without explicit stepwise reasoning supervision (Tang et al., 2020, Jiang et al., 2022). On MuSiQue, single-hop models suffer a 30-point F1 drop relative to multihop-aware models, highlighting task difficulty when genuine composition is enforced (Trivedi et al., 2021).

5. Systemic Challenges, Error Modes, and Design Recommendations

Shortcut problem: Many multi-hop QA benchmarks can be artificially solved by locating the unique entity of the desired type among distractors, thus circumventing true multi-hop reasoning (Min et al., 2019, Trivedi et al., 2021, Jiang et al., 2022).

Error propagation: Decomposition pipelines—where a complex question is partitioned into sub-questions—are susceptible to cascading errors and accumulation of noise, especially when sub-question boundaries are ambiguous or machine-generated sub-questions lack interpretive clarity (Tang et al., 2020, Li et al., 2022, Su et al., 6 Oct 2025).

Semantic misalignment: Humans and models alike make errors by returning an answer of the wrong semantic type (“where”/“when” swaps), or by confusion over similarly named entities (Su et al., 6 Oct 2025).

Adversarial robustness: Most architectures suffer significant accuracy drops when distractors are constructed adversarially to mask shortcut patterns or when unanswerable “contrast” examples are added (Trivedi et al., 2021, Li et al., 2022).

System design recommendations:

Automate complexity assessment to triage queries as single- or multi-hop (Su et al., 6 Oct 2025).
Decompose queries into explicit, semantic-sub-questions, minimizing ambiguity (Su et al., 6 Oct 2025, Li et al., 2022, Malon et al., 2020).
Supervise or regularize intermediate sub-question answering (Jiang et al., 2022, Tang et al., 2020).
Freeze and conserve single-hop skills while expanding multi-hop capabilities to prevent forgetting (Deng et al., 2022).
For open-domain settings, prioritize joint retrieval+reasoning evaluation as retrieval often becomes the primary bottleneck (Feldman et al., 2019, Min et al., 2019, Lee et al., 15 Mar 2026).

6. Open Research Directions and Broader Implications

Developing scalable, trustworthy, and faithful multi-hop QA systems remains a major challenge:

Dataset design must enforce connected reasoning via rigorous construction and filtering (e.g., DAGs, masking, adversarial distractors, unanswerable contrast sets) (Trivedi et al., 2021, Lee et al., 15 Mar 2026).
Modeling advances include modular networks, prompt-tuning, graph-augmented generation, joint reasoning with sub-question supervision, and dynamic adaptation to variable-hop complexity (Ramesh et al., 2023, Deng et al., 2022, Li et al., 2022, Tang et al., 2020).
Hybrid and collaborative systems can exploit the complementary strengths of humans (integration, nuanced reading) and AI (retrieval, decomposition, consistency checking) for robust, high-accuracy pipelines (Su et al., 6 Oct 2025).
Explainability and evaluation: Judging stepwise reasoning fidelity via explicit chain outputs (fact pointers, reasoning paths, or generated sub-questions) is critical for both error analysis and AI trustworthiness (Yavuz et al., 2022, Malon et al., 2020).
Scientific and enterprise applications: Retrieval-augmented multi-document QA (as in IM-SciQA, CIM-SciQA) enables fine-grained evaluation of document retrieval versus evidence synthesis, with implications for fact-checking, legal, and biomedical domains (Lee et al., 15 Mar 2026).

A plausible implication is that future QA systems will dynamically route queries along pipelines tailored to the complexity of compositional reasoning required, with integrated support for decomposition, retrieval, cross-passage inference, and final synthesis—leveraging interpretability and robustness as core design imperatives.