Multi-hop QA Benchmarks

Updated 4 January 2026

Multi-hop QA benchmarks are datasets that assess a system's capability to perform sequential reasoning across multiple evidence sources.
They require explicit decomposition of questions into sub-questions with annotated intermediate answers to monitor chain integrity and error propagation.
Recent benchmark designs incorporate counterfactual editing, multi-modal pipelines, and LLM-guided generation to address reasoning shortcuts and evaluate complex inference.

Multi-hop question answering (QA) benchmarks are datasets constructed to rigorously evaluate the ability of systems—principally LLMs—to perform reasoning over multiple intermediate inference steps, integrating information from disparate sources or modalities, to derive correct answers. Unlike single-hop QA, where the answer is often found via a direct retrieval from a single passage or fact, multi-hop QA explicitly requires sequential evidence retrieval, sub-question answering, and synthesis (in the sense: each answer depends on the output of one or more prior intermediate reasoning steps). The field encompasses a rapidly evolving array of benchmark designs that probe different aspects of reasoning complexity, knowledge novelty, multi-modality, and the robustness of both retrieval and generation components.

1. Benchmark Taxonomy and Construction Methodologies

Benchmark design in multi-hop QA has evolved from simple linked-passage settings (HotpotQA, 2WikiMultiHopQA) to sophisticated constructions that enforce connected reasoning, control for memorization, and target previously underexplored axes such as knowledge novelty, semantic diversity, and multi-modal integration.

A foundational distinction concerns the construction methodology:

Top-down, annotation-centric approaches: Classic benchmarks such as HotpotQA rely on crowdworkers to author questions over pairs of Wikipedia paragraphs, with answer and supporting fact annotations but without strong controls against reasoning shortcuts or data contamination (Trivedi et al., 2021).
Bottom-up, decomposition-driven pipelines: For example, MuSiQue composes multi-hop questions by systematically chaining single-hop questions while imposing strong connectedness filters: each subsequent hop (sub-question) is only answerable if the previous-hop answer is known, implemented via masking and answerability tests with strong reading comprehension models. This bottom-up methodology yields DAG-structured reasoning chains and enables the creation of contrastive unanswerable pairs (Trivedi et al., 2021).
Knowledge-edited and counterfactual datasets: CofCA (IRE) knowledge-edits Wikipedia passages—substituting all named entities, dates, and facts with novel surrogates—to prevent memorization from pre-training and enables joint evaluation over factual and counterfactual contexts. Each question is strictly decomposed into sub-questions with annotated intermediate answers (Wu et al., 2024).
Hybrid and cross-modal pipelines: HybridQA requires reasoning over both tables and text, generating questions that necessitate interleaved operations on tabular data and unstructured passages. MuMuQA builds two-hop chains that traverse visual entity grounding (image-caption alignment) and textual span extraction from real-world news (Chen et al., 2020, Reddy et al., 2021).
Large-scale, automatic, and difficulty-controlled frameworks: Advanced pipelines leverage LLM-guided evidence tree generation (MHTS), multi-hop reasoning over long contexts (NovelHopQA), or explicit control over knowledge popularity/novelty (MINTQA) to construct highly stratified corpora with fine-grained control of semantic diversity and reasoning depth (He et al., 2024, Lee et al., 29 Mar 2025, Gupta et al., 20 May 2025).

A summary table of characteristic construction axes:

Benchmark Type	Construction Pipeline	Reasoning Control
HotpotQA, 2WikiMultiHopQA	Manual question authoring	Weak (some overlap)
MuSiQue	Single-hop recombination	Strong via masking
CofCA (IRE)	Counterfactual editing	Joint chain + non-leak
HybridQA	Table+text, pipeline	Heterogeneity
MuMuQA, DocHop-QA	Cross-modal, LLM-guided	Visual/textual
MHTS, MINTQA, NovelHopQA	LLM auto-gen, stratified	Difficulty, hops, knowledge axis

2. Reasoning Chain Decomposition and Sub-question Annotation

A core organizing principle for modern benchmarks is the explicit decomposition of the target question into a sequence of sub-questions and annotated intermediate answers. This enables:

Measurement of chain integrity: Evaluation extends beyond the final (“hop N”) answer to performance at each intermediate step, penalizing models that “bypass” the true reasoning chain (e.g., produce correct answers via spurious context matching rather than correct stepwise synthesis) (Wu et al., 2024).
Isolation of error propagation: Performance drop-off from sub-QA1 to sub-QA(N), and error compounding across hops, is directly observable.
Training signal for decomposition-aware models: Multi-hop QA models (e.g., MUPPET, HGN, PRISM) can leverage sub-question annotation for more precise retrieval and answer generation at each step (Feldman et al., 2019, Fang et al., 2019, Nahid et al., 16 Oct 2025).

Both MuSiQue and CofCA (IRE) annotate each N-hop question as a chain of N sub-questions plus final QA, with all intermediate answers, formally represented as

$Q\,=\, [q_{\mathrm{sub}_1},\,q_{\mathrm{sub}_2},\,\ldots, q_{\mathrm{sub}_N},\,q_{\mathrm{final}}],\quad A\,=\, [a_{\mathrm{sub}_1},\,a_{\mathrm{sub}_2},\,\ldots, a_{\mathrm{sub}_N},\,a_{\mathrm{final}}]$

Intensive validation (e.g., MuSiQue’s condition: every sub-question is unanswerable without its direct predecessor’s answer) ensures there are no reasoning shortcuts.

3. Evaluation Metrics and Experimental Protocols

Multi-hop QA evaluation encompasses a spectrum of metrics aimed at capturing answer correctness, evidence recall, and the integrity of the reasoning chain:

Answer metrics: F1 (token overlap) and EM (Exact Match) scores on the final answer (Trivedi et al., 2021, Wu et al., 2024).
Support/evidence metrics: F1/EM on supporting sentences or paragraphs, especially in distractor-rich settings (HotpotQA, MuSiQue, PRISM) (Fang et al., 2019, Nahid et al., 16 Oct 2025).
Chain/Joint metrics: Chain-level evaluation aggregates per-step precision/recall via products; joint metrics integrate all sub-QA and final QA results into a single score. E.g., in CofCA:

$P^{(\mathrm{joint})}=P^{(\mathrm{MHQA})}\times\prod_{i=1}^N P^{(\mathrm{sub}_{qa}^i)},\;\; R^{(\mathrm{joint})}=R^{(\mathrm{MHQA})}\times\prod_{i=1}^N R^{(\mathrm{sub}_{qa}^i)}$

with joint F1

$\mathrm{Joint\,F1\,RC}=-\log\left(\frac{2\cdot P^{(\mathrm{joint})}R^{(\mathrm{joint})}}{P^{(\mathrm{joint})}+R^{(\mathrm{joint})}}\right)$

(Wu et al., 2024).

Difficulty and stratification metrics: MHTS introduces a fine-grained query difficulty score:

$D = h - \lambda\,s$

where $h$ is the number of hops and $s$ is average semantic similarity (cosine) between the question and supporting chunks. This score correlates strongly with system failure rates (Pearson $r=0.99$ with win-rate) (Lee et al., 29 Mar 2025).

Semantic/structural diagnostics: Metrics track not just accuracy but semantic diversity, supporting evidence diversity, and retrieval-vs-generation isolation.
RAG (retrieval-augmented generation) protocols: NovelHopQA and PRISM explicitly test models in both full-context and RAG settings, stratifying results by accuracy across context window sizes, number of hops, and evidence recall (Gupta et al., 20 May 2025, Nahid et al., 16 Oct 2025).

4. Key Datasets and Their Design Axes

Selected exemplars highlight the evolution of the field:

Benchmark	Modality	Hops	Reasoning Structure	Novelty Control	Type of Evidence	Notable Features
HotpotQA	Text	2	Paragraph chains	None	Wikipedia, supporting sentences	Large, crowd-authored, distractors
2WikiMultiHop	Text	2–4	Entity bridge/compositional	None	Wikipedia	Linked entity pairs, multi-hop
MuSiQue	Text	2–4	DAG sub-questions, filtered	Hard anti-shortcut	Wikipedia, per-hop contexts	Rigorous connectedness, contrast pairs
CofCA (IRE)	Text	2–4	Stepwise, counterfactual	Counterfactual chains	Wikipedia (rewritten, unseen)	Explicit bypass/chain metrics, sub-QAs
HybridQA	Table+Text	2–3	Table/text hybrid hops	None	Wikipedia tables/passages	First large hybrid, detailed OP pipeline
MuMuQA	Text+Image	2	Visual-text hop chains	None	News images/captions + body text	Grounding + extraction, synthetic QA pipeline
DocHop-QA	Multi-mod	2–4	Fan/chain-hop; multimodal	None	PubMed: text, tables, layout	Scientific PDFs, cross-doc, multimodal tasks
MINTQA	Text/KG	1–4	Fact chains (novel/tail)	Popularity/recency	Wikidata, KG-linearized	Fine-grained knowledge novelty axis
NovelHopQA	Narrative	1–4	Paragraph chain (keyword)	Long context	Full-length novels (64k–128k tokens)	Hop/length stratified, open-source pipeline
MHTS	Text	1–6	Tree-structured claim QA	Difficulty (hops, sem)	Gutenberg, diverse clustering	Explicit D score, fine-grained strat.

5. Empirical Findings, Modeling Insights, and Limitations

Key experimental trends and failure analyses from recent benchmarks:

Shortcuts and data leakage: Factual Wikipedia-based benchmarks tend to over-estimate “reasoning” due to model memorization or shallow shortcut exploitation. Counterfactual or knowledge-edited settings (IRE/CofCA, MINTQA) restore a 15–25 point gap in F1/EM, indicating a substantial memorization bias in prior results (Wu et al., 2024, He et al., 2024).
Sub-question prompt inclusion: Providing explicit sub-questions in the prompt (decomposition-aware prompting) leads to consistent gains in both final QA and joint chain performance (Wu et al., 2024).
Difficulty stratification: As hop count and semantic dispersion increase, accuracy consistently decreases (e.g., average accuracy falls by 12–14 points from 1-hop→4-hop at fixed context; in long narratives, RAG accuracy falls 25–35 points below full-golden-context) (Gupta et al., 20 May 2025, Lee et al., 29 Mar 2025).
Chain bypass and error analysis: A significant portion of answers is achieved via bypass chains or incomplete reasoning, with joint chain correctness typically below 40% for 2-hop and dropping steeply at higher depths (Wu et al., 2024).
Limitations and errors: Dominant error types include missing final-hop evidence integration, entity confusion, and cumulative evidence drift, as well as retriever failures in multi-hop, long-context, or novel-fact settings (Gupta et al., 20 May 2025, He et al., 2024).
Robustness to question type and decomposition: Performance and optimal modeling approaches are highly sensitive to question type (inference, comparison, temporal, null/open) (Zhang et al., 17 May 2025).

6. Specialized Benchmark Axes: Retrieval, Modality, and Agentic Protocols

Modern benchmarks systematically probe:

Retrieval–Generation Disentanglement: PRISM, MHTS, and MINTQA stratify evaluation to diagnose failure sources—retriever (missing/support-insufficient evidence) vs. reader/generator (span extraction/generation errors) (Nahid et al., 16 Oct 2025, Lee et al., 29 Mar 2025, He et al., 2024).
Difficulty and semantic diversity: MHTS introduces a continuous difficulty score tied to both the number of hops and average semantic proximity, empirically correlating with win-rate and system breakdowns (Lee et al., 29 Mar 2025).
Agentic, planner-based reasoning: BELLE dynamically allocates reasoning “operators” (sub-step, single/iterative-step retrieval, CoT) via bi-level agentic debate, exploiting question-type sensitivity for cost-effective gains (up to +7 F1 over best fixed baselines) (Zhang et al., 17 May 2025).
Multi-modality: HybridQA and DocHop-QA operationalize QA that can only be solved by integrating both structured/tabular and unstructured text (HybridQA) or by jointly grounding in textual, tabular, and visual/PDF layout information (DocHop-QA). MuMuQA probes image–text grounding in news, revealing a strong bottleneck in current model cross-modal co-reference (Chen et al., 2020, Park et al., 20 Aug 2025, Reddy et al., 2021).

7. Implications, Recommendations, and Directions

Multi-hop QA benchmarks have catalyzed increasingly nuanced evaluations of system reasoning, moving the field from coarse answer accuracy on single-passage queries to stepwise, evidence-grounded, and knowledge-novel reasoning diagnostics. Clear recommendations—grounded in recent benchmarks—are:

Publish per-hop sub-question chains and enforce stepwise answer annotation (Wu et al., 2024, Trivedi et al., 2021).
Adopt counterfactual editing or knowledge-graph composition to eliminate pretraining data shortcuts (Wu et al., 2024, He et al., 2024).
Use joint chain evaluation metrics to penalize incomplete or bypassed reasoning (Wu et al., 2024).
Tune retrieval–reasoning pipeline and agentic orchestration to question type and difficulty (Nahid et al., 16 Oct 2025, Zhang et al., 17 May 2025).
Explicitly stratify by context length, knowledge novelty, modality, and semantic diversity for diagnostic evaluation (Lee et al., 29 Mar 2025, Gupta et al., 20 May 2025).

Recent advances suggest that, despite dramatic LLM progress, robust and general multi-hop reasoning—especially in the presence of novel, long-tail, deeply multi-modal, or long-context evidence—remains an open challenge, and that continual benchmark innovation is critical for meaningful progress in this domain.