Papers
Topics
Authors
Recent
2000 character limit reached

HotpotQA: Multi-Hop QA Benchmark

Updated 17 December 2025
  • HotpotQA is a large-scale dataset that requires multi-hop reasoning over multiple supporting documents with explicit sentence-level supervision.
  • It standardizes evaluation with metrics like EM and F1 for both answers and supporting facts, highlighting performance gaps to human-level QA.
  • HotpotQA has spurred innovations in retrieval-augmented generation, graph-based reasoning, and modular architectures, influencing future QA system designs.

HotpotQA is a large-scale, diverse benchmark for multi-hop question answering that catalyzed substantial advances in explainable, compositional QA over unstructured text corpora. It is both a dataset and a standardized task environment for evaluating complex multi-hop reasoning, integrating answer and supporting fact supervision, with broad methodological influence on retrieval, reasoning, and explainability in QA systems.

1. Dataset Design, Composition, and Supervision

HotpotQA consists of 112,779 English Wikipedia-based question–answer pairs specifically constructed to require multi-hop reasoning over multiple supporting documents (Yang et al., 2018). Each example includes:

  • A question, answer (text span or yes/no), and indices of the supporting paragraphs and their gold supporting sentences.
  • Distractor context: in the “distractor” setting, each question is paired with its two gold paragraphs plus eight TF-IDF–retrieved distractors, yielding 10 context paragraphs per instance.
  • Fullwiki context: in the “fullwiki” setting, the QA system must retrieve relevant paragraphs from the full English Wikipedia (∼5 million articles).

Crucially, sentence-level supporting fact labels are available, enabling models to receive strong supervision for the reasoning chain as well as the final answer. On average, questions require 2–3 supporting sentences.

HotpotQA provides a split into several subsets: train-easy (mostly single-hop), train-medium/hard (multi-hop), dev, test-distractor, and test-fullwiki, with the three training splits merged for standard model development.

2. Multi-hop Reasoning Paradigms and Question Taxonomy

HotpotQA explicitly targets diverse reasoning types:

  • Bridge entity questions (∼42%): inference over a chain, e.g., “Which team does the 2015 Diamond Head Classic MVP play for?”.
  • Comparison questions (∼27%): cross-document comparison, e.g., comparing properties of two entities.
  • Intersection (∼15%), property transfer (∼6%), and “Other” questions account for the remainder. There is a small fraction of single-hop (∼6%) and unanswerable (∼2%) cases.

Each question is annotated such that answering requires non-trivial compositional inference, often merging information from two or more paragraphs.

3. Evaluation Metrics and Baseline Performance

HotpotQA uses robust metrics for both answer and evidence prediction:

  • Answer EM and F1: exact match and token-level overlap between predicted and gold answer spans.
  • Supporting-Fact EM and F1: set-level comparison of predicted and annotated supporting sentences.
  • Joint EM/F1: both answer and full support set must match for the instance to count as correct.
  • Baseline BiDAF++ models achieve 59% F1 (answer) on the distractor setting, with joint F1 ∼41%; fullwiki is substantially harder (∼34% F1) (Yang et al., 2018).

The dataset also reports human agreement: answer F1 ≈ 91.4% and supporting-facts F1 ≈ 90%, indicating a significant gap to SOTA.

4. Advances Enabled by HotpotQA: Architectures and Reasoning Techniques

HotpotQA benchmarking catalyzed several lines of research:

Retrieval and Decomposition:

  • Multi-hop Dense Retrieval (MDR) (Xiong et al., 2020)—iterative dense-encoded beam search, query reformulation at each hop; achieves 62.3/75.3 EM/F1 on the fullwiki test set, with much faster retrieval than entity-graph approaches.
  • Memory-augmented sequential paragraph retrievers (GMF) (Shao et al., 2021)—model paragraphs as sequences with external memory and gating, reaching 63.6 EM / 76.5 F1.

Explicit Reasoning Chains and Structured Models:

  • Reasoning Chains (Chen et al., 2019)—pointer-network extracts sentence sequences forming a reasoning trace without annotator-provided chains; strong performance (F1 = 74.11) and interpretability.
  • Dynamically Fused Graph Networks (DFGN) (Xiao et al., 2019)—build entity graphs with dynamic query-guided masking and multi-hop GAT, outputting explicit reasoning paths.
  • Graph-based reasoning with hierarchical graph attention (HGN, GATH) (He et al., 2023)—multi-level graphs with ordered update of paragraph, sentence, and entity node representations; further augmented by direct query–sentence edges and hierarchical attention propagation.

Retrieval-Augmented Generation & Prompt Engineering:

End-to-End Simplicity and Joint Training:

  • The Quark pipeline (Groeneveld et al., 2020)—three-stage system (sentence selection, span prediction, answer-conditioned support retrieval), demonstrates that strong performance is achievable without explicit graphs or modular decomposition, attaining 67.75 EM / 81.21 F1.
  • Two-in-One joint ranking (Luo et al., 2021)—a single RoBERTa encoder and MLP heads jointly score passages and supporting sentences, using consistency and similarity constraints for tight coupling, yielding up to 85.82 F1.

Specialized Innovations:

  • Self-Assembling Modular Networks (Jiang et al., 2019)—dynamic layout controller composes a differentiable program of reasoning modules for each instance; interpretable compositions correlate with human-designed layouts.

5. Generation, Unsupervised, and Transfer Approaches

HotpotQA is also leveraged for multi-hop question generation (QG):

  • Multi-task and RL-augmented QG (Gupta et al., 2020)—jointly predict supporting-fact coverage, yielding BLEU-4 gains of +4.15 and human-rated fact coverage of 83%.
  • Unsupervised multi-hop QA (Pan et al., 2020)—MQA-QG generates synthetic data via sequence of atomic operators, T5-based questioners, and GPT-2 perplexity filtering; SpanBERT reader achieves 68.6 F1 (83% of supervised performance), with few-shot QG pretraining yielding a 43-point boost in F1 for 50-label settings.

Prompt-based continual learning (PCL) (Deng et al., 2022) demonstrates that freezing pretrained single-hop QA backbones and conditionally adding soft prompts for multi-hop reasoning types preserves both multi-hop and single-hop performance, reaching 71.76 EM/84.39 F1 on answers and 49.27/76.56 joint EM/F1 in the distractor setting.

6. Explainability and Chain-of-Thought Traceability

HotpotQA's sentence-level supporting fact supervision encourages development of interpretable models:

  • Explicit reasoning chains (DFGN, StepChain GraphRAG, Chain Extractor, Modular NMN) allow per-instance reconstruction of multi-hop chains, facilitating both debugging and human-in-the-loop evaluation (Xiao et al., 2019, Ni et al., 3 Oct 2025, Chen et al., 2019, Jiang et al., 2019).
  • Empirical studies confirm that chains mined or produced by leading models are nearly as informative and confidence-inspiring to humans as gold supporting facts (Chen et al., 2019).
  • StepChain GraphRAG (Ni et al., 3 Oct 2025) formalizes a full chain-of-thought interface, with on-the-fly knowledge graph expansion, explicit evidence chains for each decomposed sub-question, and complete traceability of intermediate decisions.

7. Impact, Limitations, and Open Challenges

HotpotQA established the field standard for multi-hop QA, driving improvements in open-domain retrieval, explicit structured reasoning, decompositional QA, and explainability. Despite substantial gains—SOTA models now achieve 66–72 EM, 79–84 F1 on answers in distractor settings—there remains a considerable gap to human-level performance.

Central remaining challenges include:

  • Robust multi-hop retrieval from large corpora under adversarial distractors and real-world scale, notably dramatic performance drops from distractor to fullwiki settings (Yang et al., 2018).
  • Mitigating model hallucinations and error compounding in decompositional and graph-based frameworks (Ni et al., 3 Oct 2025).
  • Reasoning over more than two facts, arithmetic and symbolic comparison, and integrating joint retrieval-reasoning pipelines.
  • Broad transfer to other multi-hop benchmarks (MuSiQue, 2WikiMultiHopQA) and continual learning scenarios (Deng et al., 2022).

Research directions proposed include integrating uncertainty-aware decomposition, backtracking, fact-verification in graph construction, and continual evolution of prompt and retrieval modules. HotpotQA continues to serve as a primary testbed for innovations in multi-hop knowledge-intensive QA.

Topic to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to HotpotQA.