Multi-hop QA Benchmarks

Updated 10 December 2025

Multi-hop QA benchmarks are structured datasets designed to evaluate models' ability to integrate multiple pieces of evidence for complex reasoning.
They leverage methodologies like chain-of-thought prompting, dynamic retrieval, and human annotation to validate and refine reasoning chains.
Evaluation protocols use metrics such as Exact Match and chain-aware F1 to pinpoint retrieval bottlenecks and systematic reasoning errors.

Multi-hop Question Answering (QA) benchmarks are structured datasets designed to rigorously evaluate a model’s ability to perform reasoning across multiple steps, often necessitating the integration of disparate pieces of evidence. These benchmarks underpin both methodological advances and diagnostic evaluation for models in open-domain, long-context, multimodal, temporal, and structured knowledge settings. The following sections synthesize prominent benchmark designs, evaluation protocols, characteristic failure modes, and directions for research and deployment.

1. Foundations and Historical Evolution

Early multi-hop QA benchmarks such as HotpotQA (Yang et al., 2018) established the core paradigm: questions require synthesis across multiple supporting documents (typically two Wikipedia paragraphs), annotated with both sentence-level supporting facts and answer spans, thus accentuating explanatory supervision. The full-wiki setting introduced a retrieval challenge, mandating iterative IR to find the necessary evidence. These foundational datasets spurred progress in passage retrieval architectures, differentiable reasoning graphs, and supporting-fact prediction.

Subsequent works addressed domain-specific and structural gaps. FinReflectKG-MultiHop (Arun et al., 3 Oct 2025) introduced temporally indexed financial knowledge graphs, enabling analyst-style 2–3 hop reasoning across sectors and years. MuMuQA (Reddy et al., 2021) extended this paradigm to news, demanding cross-modal reasoning by grounding visual objects in images and linking them to text.

Recent years have seen benchmarks markedly diversifying: NovelHopQA (Gupta et al., 20 May 2025) probes 1–4 hop reasoning over full-narrative contexts up to 128k tokens, DocHop-QA (Park et al., 20 Aug 2025) integrates multimodal scientific document collections, and Complex-TR (Tan et al., 2023) tests multi-hop, multi-answer temporal inference over time-stamped facts.

2. Benchmark Construction Methodologies

Benchmarks are constructed via controlled pipelines that enforce reasoning depth, evidence distribution, and diversity. NovelHopQA (Gupta et al., 20 May 2025) employs anchor-keyword guided paragraph chains: for each novel, multi-hop chains ( $H\in\{1,2,3,4\}$ ) are grown by sampling paragraphs containing specific high-frequency keywords, with each new hop integrating an additional paragraph via shared keywords, ensuring evidence linkage. Human annotation enforces both alignment and hop stratification.

MHTS (Lee et al., 29 Mar 2025) formalizes multi-hop benchmarks as tree structures. Leaf nodes are elementary evidence chunks; intermediate nodes are atomic or synthesized claims. Traversal from root (multi-hop claim) to leaves yields the full reasoning chain. MHTS quantifies difficulty as $D = h - \lambda s$ , with $h$ the number of hops and $s$ the average semantic dispersion (cosine similarity of embeddings), enabling fine-grained difficulty control.

GRADE (Lee et al., 23 Aug 2025) introduces a two-dimensional difficulty matrix: one axis partitions questions by hop count ( $k\in\{2,3,4,5\}$ ), the other by retriever-side semantic distance ( $D_r(q) = 1 - \min_{c_i} s(q, c_i)$ ). Synthetic datasets are generated by extracting knowledge graphs from factual sources, soft-clustering, and sampling diverse multi-hop paths.

Counterfactual reasoning is pioneered by CofCA (Wu et al., 19 Feb 2024): Gold passages are knowledge-edited to remove overlaps with model pretraining data, and questions are annotated with explicit sub-questions and intermediate answers, enabling end-to-end chain evaluation.

3. Evaluation Protocols and Error Attribution

Benchmarks apply rigorous, multi-dimensional evaluation protocols. Accuracy is typically measured by Exact Match (EM) and token-level F1. Joint metrics in HotpotQA (Yang et al., 2018) multiply answer and supporting-fact precision/recall. Reasoning-chain accuracy (Acc_chain) and chain-aware metrics (Joint F1_RC) in CofCA (Wu et al., 19 Feb 2024) require all intermediate steps and the final answer to be correct.

Oracle-context filtering, as in NovelHopQA (Gupta et al., 20 May 2025), ensures only context-answerable questions are retained: each example is discarded if any state-of-the-art model fails on it. Simultaneously, human raters validate hop-depth and alignment. Retrieval-augmented variants (RAG), e.g., in NovelHopQA and MINTQA (He et al., 22 Dec 2024), chunk context windows and provide only selected passages, typically leading to substantial declines (~30 points) in accuracy versus full-context evaluation, revealing retrieval bottlenecks.

GRADE’s cell-wise error heatmaps quantitatively separate reasoning and retrieval failures—diagonal errors (high hop count, high semantic distance) reflect compounded difficulty. MHTS empirically anchors the difficulty estimate $D$ by correlating it ( $r=0.99$ ) with RAG win-rate (Lee et al., 29 Mar 2025).

4. Characteristic Benchmarks and Domains

A representative subset of modern multi-hop QA benchmarks is presented below.

Benchmark	Reasoning Hops	Modality/Domain	Core Evaluation Metric
HotpotQA (Yang et al., 2018)	2 (fixed)	Wikipedia text	EM, F1, Supporting-Fact EM/F1, Joint EM/F1
FinReflectKG-MultiHop (Arun et al., 3 Oct 2025)	2–3	Financial KG	Correctness, Token Utilization Reduction
NovelHopQA (Gupta et al., 20 May 2025)	1–4	Narrative text	EM, Δ_H, Δ_L, RAG Accuracy
DocHop-QA (Park et al., 20 Aug 2025)	2–4	Multimodal scientific	F1 (task-dependent), BLEU, ROUGE
Complex-TR (Tan et al., 2023)	1–3	Temporal KG	SetAcc, Ans F1, Token F1
MHTS (Lee et al., 29 Mar 2025)	2–4	Synthetic/Structured	D (difficulty), RAG win-rate
CofCA (Wu et al., 19 Feb 2024)	2–4	Counterfactual text	Acc_chain, Joint F1_RC
MINTQA (He et al., 22 Dec 2024)	1–4	Popular/tail/new KB	EM, F1, Retrieval MAP/MRR

Benchmarks collectively probe text, knowledge graphs, tables, images, and multimodal layouts, often in open-ended, non-hyperlinked settings.

5. Common Failure Modes and Analytical Insights

Multi-hop QA models exhibit distinctive failure patterns as task complexity increases. NovelHopQA (Gupta et al., 20 May 2025) systematically documents several:

Missing Final-Hop Integration: Most models reconstruct initial hops but omit the decisive last clue.
Coreference/Entity Confusion: Pronoun or name ambiguity across distant evidence leads to swap errors.
Incomplete Evidence Combination: Models aggregate only a subset of required facts, omitting crucial steps.
Contextual Drift: In long contexts, models may revert to irrelevant early details rather than sustaining focus on the reasoning chain.

Such failure modes intensify with greater hop count ( $H\geq3$ ) or extended context window ( $\geq96$ k tokens). GRADE (Lee et al., 23 Aug 2025) observes a strong increase in error rates along main diagonal cells of its 2D difficulty matrix; retrieval bottlenecks are dominant when semantic similarity drops ( $D_r$ minimum). CofCA (Wu et al., 19 Feb 2024) quantifies "chain inflation"—correct final answers often bypass intended intermediate steps.

A plausible implication is that both retrieval bottlenecks and systematic reasoning errors remain unsolved, particularly in long-context, multimodal, and knowledge-edited settings.

6. Adaptive Methods and Future Directions

Contemporary frameworks increasingly employ dynamic reasoning and retrieval approaches. BELLE (Zhang et al., 17 May 2025) introduces bi-level multi-agent debate, matching question types (Inference, Comparison, Temporal, Null) to corresponding operator sets (Chain-of-Thought, single-step, iterative-step, sub-step, adaptive-step), yielding up to +7.6 F1 on MuSiQue. AMKOR (Coleman et al., 9 Feb 2025) dynamically fuses parametric and external knowledge, employing probabilistic beam reasoning and multi-granular losses to optimize both local reasoning and global answer accuracy; error propagation is mitigated by tracking multiple candidate reasoning trajectories.

Benchmarks like MINTQA (He et al., 22 Dec 2024) evaluate not only answering accuracy but strategic decision-making—when to decompose, when to retrieve, how to adaptively interleave decomposition and retrieval, and how decomposition errors cascade. Best-performing systems jointly train decomposition and retrieval, enforce explicit chain-of-thought prompting, and calibrate confidence for dynamic retrieval. Multimodal benchmarks (DocHop-QA (Park et al., 20 Aug 2025), MuMuQA (Reddy et al., 2021)) highlight that naive fusion can degrade performance; layout and cross-modal understanding are essential.

Pseudo-instruction tuning in Complex-TR (Tan et al., 2023) augments real examples with temporally shifted, fictional entities to improve multi-hop temporal reasoning and future-adaptation, yielding +13 F1 in multi-hop settings.

7. Benchmark Utility, Limitations, and Prospects

Multi-hop QA benchmarks serve as diagnostic tools for measuring LLMs’ compositional reasoning, retrieval robustness, cross-modal integration, and chain fidelity. They expose inflated scores on memorization-friendly datasets, demonstrate underperformance in counterfactual and novel contexts, and catalyze methodologically diverse developments in adaptive reasoning.

Limitations persist: Most datasets lack supervision for longer reasoning chains, rely on fixed hop templates, or underrepresent real-world multimodal complexity. Retrieval remains a central bottleneck, particularly under semantic drift or heterogeneous document pools. Many evaluation metrics (EM/F1) overstate performance where multi-answer or chain fidelity is required; set-level or chain-level accuracy is preferable. Manual validation and domain adaptation (financial, scientific, temporal) are ongoing challenges.

Current prospects include scalable benchmark construction via synthetic pipelines (MHTS, GRADE), explicit difficulty calibration, incorporating chain-aware and multimodal evaluation, and expanding benchmarks to cover broader domains, languages, and more intricate reasoning patterns. By public release of code, data, and difficulty metrics, these benchmarks define standardized, reproducible challenges for future research in reasoning-centric and retrieval-augmented QA.