Multi-hop QA Datasets Overview

Updated 12 December 2025

Multi-hop QA datasets are collections of question–answer pairs that require integrating evidence from multiple documents or modalities to infer accurate answers.
They employ methodologies such as crowdsourcing, synthetic composition, and structure-driven annotation to ensure connected reasoning paths.
Evaluation protocols use span-level metrics, supporting fact accuracy, and graph-structure analysis to diagnose multi-step inference capabilities.

A multi-hop QA dataset is a collection of question–answer pairs that specifically require the integration of information from multiple evidence fragments—spanning sentences, documents, or modalities—in order to arrive at the correct answer. These datasets are constructed to challenge and diagnose a system’s ability to perform compositional, multi-step reasoning, as opposed to retrieving answers via shallow heuristics or single-sentence matching. The field encompasses a spectrum from text-only (HotpotQA, MuSiQue), table–text (HybridQA), KB–text hybrids (2WikiMultiHopQA), generative or symbolic (MoreHopQA), explanation-supervised (eQASC), structure-annotated (GRS-QA), to unsupervised or synthetic data generation (MQA-QG, MHTS). Recent work has introduced further sub-regimes including pluri-hop (exhaustive, recall-sensitive) and multi-hop temporal reasoning.

1. Foundational Concepts and Canonical Datasets

The defining feature of multi-hop QA datasets is that the evidence required to answer a question is distributed such that no single fragment suffices; correct inference necessitates combining two or more supporting facts. The seminal HotpotQA dataset (Yang et al., 2018) introduced the following core features:

Explicit multi-hop requirement: Each question is paired with multiple relevant Wikipedia passages, and answer prediction depends on reasoning across at least two of them.
Diverse question types: HotpotQA includes both bridge-type (entity chaining) and comparison-type (extract–compare–choose) questions.
Supporting fact annotation: Sentence-level supporting facts are labeled, enabling strong supervision for both answer and justification prediction.
Span supervision: It uses a span-based answer format to reduce multiple-choice biases and shortcut exploitation.

Subsequent datasets have diversified the landscape:

WikiHop (Chen et al., 2019) introduced a multi-document, multiple-choice regime but was shown to be vulnerable to non-multi-hop baselines due to spurious correlations.
HybridQA (Chen et al., 2020) required aggregation across semi-structured tables and linked text passages.
2WikiMultiHopQA (Ho et al., 2020) enforced reasoning over structured (Wikidata triples) and unstructured (Wikipedia) contexts with explicit reasoning paths and evidence annotations.
MuSiQue (Trivedi et al., 2021) advanced construction algorithms to guarantee “connected” multi-hop reasoning and minimized shortcut susceptibility via a bottom-up single-hop composition and filtering strategy.
MoreHopQA (Schnitzler et al., 19 Jun 2024) transformed extractive benchmarks by layering in arithmetic, symbolic, and commonsense generative steps.

2. Construction Methodologies and Evidence Guarantees

Dataset construction paradigms directly shape the diagnostic value of a benchmark. Mainstream methods include:

Crowdsourced composition: HotpotQA and HybridQA used qualified annotators to author questions explicitly requiring multi-step document or table-to-text hops, with post-hoc verification to filter shallow instances (Yang et al., 2018, Chen et al., 2020).
Synthetic bottom-up DAG composition: MuSiQue systematically composes and filters single-hop QAs such that each hop is indispensable, employing formal conditions to prevent disconnected reasoning chains and artifact exploitation (Trivedi et al., 2021).
Template and rule-based generation: 2WikiMultiHopQA uses NER-driven question templates and logical rules mined from Wikidata triples, coupled with text–KB alignment heuristics, to synthesize guaranteed two-hop questions (Ho et al., 2020).
Unsupervised and semi-automated pipelines: MQA-QG (Pan et al., 2020) and MHTS (Lee et al., 29 Mar 2025) harness neural operators, LLM claim fusion, and automatic filtering to generate large-scale synthetic multi-hop pairs, while controlling for reasoning complexity, evidence composition, and answerability.
Structure-driven annotation: GRS-QA (Pahilajani et al., 1 Nov 2024) adds reasoning graph structures (nodes: supporting sentences; edges: inference flow) to standard multi-hop datasets, creating contrastive positive/negative examples for structural ablations.

A critical dimension is evidence path supervision, which is manifest as sentence-level (HotpotQA, MuSiQue), KB triple-level (2WikiMultiHopQA), graph-path-level (GRS-QA), or explicit chain-of-reasoning annotations (eQASC (Jhamtani et al., 2020)).

3. Taxonomies of Reasoning Structures and Diversity

Multi-hop QA datasets capture a diverse range of reasoning phenomena. Reasoning structures are systematically categorized as:

Chain (bridge): Linear sequence of inferences (s₁→s₂→⋯→sₖ) as in HotpotQA bridges, MuSiQue, GRS-QA Bridge.
Comparison: Parallel extraction/comparison over two or more entities or entities' properties (HotpotQA, 2WikiMultiHopQA, GRS-QA C-*).
Compositional (tree): Hierarchical aggregation, e.g. table→passage→cell, as in HybridQA and GRS-QA compositional graphs.
Temporal or symbolic: Interval arithmetic, multi-event overlap (Complex-TR (Tan et al., 2023)), symbolic ops (MoreHopQA).
Pluri-hop: Recall-sensitive, exhaustive queries over repetitive, distractor-rich corpora (PluriHopWIND (Sveistrys et al., 16 Oct 2025)).

The following table summarizes principal datasets by structures and coverage:

Dataset	Core Structure(s)	Evidence Modalities
HotpotQA	Bridge, comparison	Wikipedia text
MuSiQue	2–4 hop chains, DAGs	Wikipedia text
HybridQA	Table↔text hybrids	Tables + Wikipedia passages
2WikiMultiHopQA	Chain, comp., bridge-comp	Wikidata + Wikipedia
GRS-QA	Explicit graph (chain/tree/comp.)	Wikipedia/Synth
MoreHopQA	Multi-hop (+arithmetic, symbolic)	Wikipedia + Generative
Complex-TR	Temporal multi-hop	Wikidata quintuples
PluriHopWIND	Pluri-hop, exhaustive	Real-world reports

More recent benchmarks (e.g., MHTS (Lee et al., 29 Mar 2025), BMGQ (Qiu et al., 28 Oct 2025)) provide programmatic control over reasoning depth, semantic dispersion, and question difficulty, facilitating dataset-level ablation.

4. Evaluation Protocols, Metrics, and Diagnostic Probes

Evaluation in multi-hop QA is multidisciplinary, targeting both answer prediction and reasoning process:

Span-level EM/F1: Measures extraction accuracy for answer spans; standard in HotpotQA, MuSiQue, 2WikiMultiHopQA.
Supporting fact F1/EM: Binary or set-based scoring of predicted vs. gold supporting sentences or KB triples (Yang et al., 2018, Ho et al., 2020).
Graph-structure scores: In GRS-QA, reasoning-graph precision/recall and effect of structural perturbations (adding/removing/swapping edges/nodes) explicitly test model sensitivity to reasoning path structure (Pahilajani et al., 1 Nov 2024).
Statement-wise F1: In pluri-hop and list-based scenarios (PluriHopWIND), answers are decomposed into atomic statements, with F1 penalizing both omissions and spurious items (Sveistrys et al., 16 Oct 2025).
Generative/out-of-span accuracy: In non-extractive settings (MoreHopQA), EM and F1 are computed on algorithmically constructed generative outputs; “perfect reasoning” metrics assess correct execution of all sub-steps of the decomposition (Schnitzler et al., 19 Jun 2024).
Diagnostic artifact probes and no-context baselines: To assess susceptibility to superficial cues, models are evaluated on input ablations (question-only, context-only, 1-para only, shallow sentence-factored) (Chen et al., 2019, Trivedi et al., 2021).

Evaluation often includes ablations (e.g., structure-perturbed graphs in GRS-QA or MHTS’s difficulty sweep), and transfer/robustness testing (e.g., perturbation-invariant reasoning in eQASC (Jhamtani et al., 2020)).

5. Challenges, Limitations, and Design Pitfalls

Systematic studies have highlighted several challenges in constructing and evaluating multi-hop QA datasets:

Artifact vulnerabilities: Many early datasets allowed “shortcut” models—sentence-factored, question–answer correlation exploitation, or shallow retrieval heuristics—to perform nearly as well as purported multi-hop architectures (Chen et al., 2019, Trivedi et al., 2021).
Blurred hop boundaries: In practice, a substantial fraction of questions in widely used datasets (WikiHop, HotpotQA) are solvable by localizing the answer via a single sentence, as shown by the performance of factored baselines.
Multiple-choice vs. span-based supervision: Candidate-based formulations (multiple choice) are especially prone to gaming. Span-based and decomposition–aware annotations provide stronger constraints (Chen et al., 2019, Yang et al., 2018).
Semantic and structural bias: Context assembly, distractor selection, and evidence–answer alignment can all introduce non-trivial biases, potentially inflating system performance or underdiagnosing true reasoning competence (Trivedi et al., 2021, Ho et al., 2020).
Evidence annotation incompleteness: Sentence-level or KB triple–level evidence chains may be under-specified, permitting alternative chains or answer justifications not captured in annotations (Jhamtani et al., 2020, Ho et al., 2020).

Recent datasets deploy more rigorous construction protocols: MuSiQue’s “connected-reasoning” filtering and split minimization, PluriHop’s distractor-rich composition, BMGQ’s reasoning-graph validation, and MHTS’s explicit hop/difficulty control.

6. Advanced Regimes and Future Directions

New multi-hop QA settings have been introduced to probe system capabilities beyond classical Wikipedia settings:

Pluri-hop/recall-sensitive QA: PluriHopWIND defines a regime where exhaustive, recall-sensitive aggregation across distractor-rich, repetitive corpora is essential. Baseline RAG and graph-based methods yield sub-40% F1, while specialized architectures (PluriHopRAG) using query decomposition and cross-encoder filtering achieve significant gains (up to 52% relative F1 improvement) (Sveistrys et al., 16 Oct 2025).
Temporal, arithmetic, and generative reasoning: MoreHopQA adds multi-hop questions requiring arithmetic, symbolic, and commonsense composition; GPT-4 achieves only 38.7% perfect reasoning, substantially lower than on the original 2-hop extractive questions (Schnitzler et al., 19 Jun 2024). Complex-TR explicitly targets multi-hop and multi-answer temporal queries, with answer-level F1 drops (71.9→46.8 for FLAN-T5-XL) on multi-hop vs. one-hop (Tan et al., 2023).
Difficulty-controlled and diagnostic datasets: MHTS (Lee et al., 29 Mar 2025) and BMGQ (Qiu et al., 28 Oct 2025) introduce pipelines where hop count, semantic dispersion, and logical complexity are programmatically controlled and empirically correlated with RAG system performance (e.g., MHTS’s difficulty score D=h–λs correlates at r=0.99 with retrieval loss).
Structure-aware supervision: GRS-QA’s explicit reasoning-graph annotation allows systematic evaluation of performance as a function of graph topology (chains, multi-branch trees, comparisons) and provides negative graph perturbations as hard controls (Pahilajani et al., 1 Nov 2024).
Unsupervised and scalable synthesis: MQA-QG demonstrates that synthetic, unsupervised multi-hop QA pairs (via operators + small reasoning graphs) can train models to reach 61–83% of fully supervised performance with dramatically less human labeling (Pan et al., 2020).

7. Significance and Impact

Multi-hop QA datasets constitute a critical substrate for the development and assessment of compositional inference and retrieval–generation architectures. Their design evolution reflects persistent efforts to minimize artifact exploitation, to provide strong evidence and chain-of-reasoning supervision, and to expose deeper limitations in current QA systems. Explicit reasoning graph annotations, fine-grained hop/difficulty control, and moves toward generative, arithmetic, temporal, and pluri-hop scenarios highlight a trajectory toward increasingly diagnostic, robust, and realistic evaluation.

Open directions revealed by recent work include: (i) richer multi-modal and cross-lingual multi-hop scenarios, (ii) programmatic integration of symbolic, arithmetic, and commonsense reasoning, (iii) systematic testing of structure-aware and contrastive evaluation regimes, and (iv) scalable, unsupervised or weakly-supervised dataset generation for novel domains and tasks. Despite significant progress, the performance gap between humans and state-of-the-art systems—especially under robust, multi-hop benchmarks with explanations or generative reasoning—remains consistently large, underscoring the continued centrality of well-designed multi-hop QA datasets for advancing machine reasoning (Trivedi et al., 2021, Pahilajani et al., 1 Nov 2024, Lee et al., 29 Mar 2025, Schnitzler et al., 19 Jun 2024, Sveistrys et al., 16 Oct 2025).