Papers
Topics
Authors
Recent
Search
2000 character limit reached

MuSiQue: Multihop QA Benchmark

Updated 4 February 2026
  • MuSiQue is a family of multihop QA benchmarks that uses directed acyclic graphs to enforce step-by-step, connected reasoning.
  • The benchmark employs stringent compositional filtering and eight-stage pipelines to prevent shortcuts and artifacts in question construction.
  • Evaluation results and compression studies illustrate significant human-model gaps and guide advancements in retrieval, prompting, and model design.

MuSiQue is a family of multihop question answering (QA) benchmarks designed to rigorously assess a model’s capability to perform true multi-step, connected reasoning under both closed-book and open-domain retrieval settings. By enforcing structural properties and applying stringent compositional filtering, MuSiQue sets a new standard for testing multi-hop reasoning in natural language processing, revealing shortcomings in both pre-trained models and recent prompting-based approaches.

1. Formal Definition, Dataset Structure, and Construction

MuSiQue’s foundation rests on constructing kk-hop questions through the composition of single-hop QA pairs, captured as directed acyclic reasoning graphs (DAGs) GQ=(V,E)G_Q = (V, E) where nodes qiq_i are sub-questions, and (qjqi)(q_j \to q_i) indicates that qiq_i’s answer is contingent on qjq_j’s answer. Each kk-hop question requires a chain or more complex graph structure (chains, trees, diamonds; six patterns in total), with each sub-question answer unavailable in context unless previous reasoning is carried out.

Composability and Connectedness: Two single-hop pairs (q1,a1)(q_1,a_1), (q2,a2)(q_2,a_2) are composable if a1a_1 is a named entity present verbatim in q2q_2, a2a_2 is not present in q1q_1, and paragraphs are distinct and disambiguated. The MuSiQue condition enforces that no model, even strong neural QA models, can answer any hop without information from its predecessors. This is formalized by requiring, for each edge (qjqi)(q_j \to q_i),

M(qi  with  aj  masked;C)ai,M(qi;)aiM(q_i\;\text{with}\;a_j\;\text{masked};\,C) \neq a_i,\quad M(q_i; \varnothing) \neq a_i

ensuring no one-hop shortcut or context artifact suffices (Trivedi et al., 2021).

Pipeline: The construction proceeds in eight algorithmic stages: filtering “bad” single-hop questions, extracting composable pairs, applying strong disconnection filtering, assembling higher-hop DAGs, partitioning splits to avoid leakage (no overlap in sub-questions, answers, paragraphs across splits), generating contexts with hard BM25 distractors, crowdsourcing final question composition for fluency and connectedness, and generating paired unanswerable variants (for MuSiQue-Full).

Variants:

  • MuSiQue-Ans: 24,814 multihop questions (2–4 hops), with detailed sub-question decomposition and answer spans; ~21,000 unique single-hop subquestions; 7,676 unique supporting paragraphs.
  • MuSiQue-Full: Adds an unanswerable variant for every answerable question (total 50,000), enforcing detection of answerability, answer, and supporting paragraphs in tandem.

Controlled properties include enforced connectedness, mixed-hop distribution, hard distractors, no leakage of information between splits, and a split of answerable/unanswerable contrasts (Trivedi et al., 2021).

2. Evaluation Protocols and Baseline Results

Evaluation is based on exact match (EM) and token-level F1 for answers, with paragraph-level EM/F1 for supporting paragraph retrieval. For unanswerable cases (MuSiQue-Full), composite metrics consider both answerability and support retrieval.

Results on MuSiQue-Ans indicate a large gap between human and model performance:

Dataset Human Answer F1 Best Model F1 Human–Model Gap
HotpotQA-20K 84.5 74.9 (SA) 9.6
2Wiki-20K 83.2 79.8 (EX) 3.4
MuSiQue-Ans 78.0 47.3 (EX(SA)) 30.7

Disconnected reasoning diagnostics (“cheatability”): MuSiQue-Ans restricts artifact-exploiting models to ~37.8% (answer) and ~63.4% (support), about half that of prior benchmarks. Naive splits or removal of disconnection filtering produces trivial tasks (model F1 up to 87.3, 1-Para F1 85.1).

Full details on splits, metrics, and human annotation are rigorously accounted for, with ~30 pt F1 drops (human–model) and severe collapse when presented only with context or question (Trivedi et al., 2021).

3. Applications in LLM Compression and Model Design

MuSiQue exposes distinct sensitivities of reasoning vs. knowledge memorization for large reasoning model (LRM) compression:

  • Quantization, which keeps parameter count but reduces bit precision (to 2.51, 1.73, 1.58 bits), is highly effective: e.g., DeepSeek-R1 (FP16, 671B params) at EM 17.0, F1 27.51 degrades only to EM 14.0, F1 22.34 in the most aggressive quantization.
  • Distillation/pruning inflicts sharply steeper drops: R1-Distill-Llama-70B drops to EM 13.0, F1 21.8; Distill-Qwen-32B collapses to near zero (EM 1.0). Aggressive pruning leads to catastrophic failure beyond 30% sparsity.

Empirical finding: parameter count is far more critical for retaining the parametric knowledge needed by MuSiQue than for reasoning per se—quantization preserves knowledge; size reduction destroys it. Output conciseness is key: short outputs (“shortest 30%”) achieve up to 30% EM and 42.8% F1, while verbose outputs (“longest 30%”) deteriorate to 3.3% EM and 10.0% F1 (Zhang et al., 2 Apr 2025).

Compression recommendations:

  • Favor quantization over distillation/pruning for closed-book multihop tasks.
  • Limit pruning to ≤30% sparsity.
  • Use the largest possible student for distillation.
  • Enforce answer brevity for higher closed-book accuracy.

4. Advances in Retrieval and Multi-Hop QA Systems

MuSiQue and MuSiQue-Ans have catalyzed innovation in retriever and reader design for multi-hop tasks:

  • Beam Retrieval (Zhang et al., 2023): Introduces end-to-end beam-based retriever with joint optimization and multiple hypothesis chains per hop, showing a +44.6% EM gain (53.5→79.3) and +20.2 Answer F1 points (49.0→69.2) over prior art.
  • Success factors include supervision across all hops, beam expansion to cover multiple chains (rescuing from dead ends), and a two-head classifier architecture. Beam‐size consistency and robust training are essential; shuffling passage order further improves support/answer F1 by 1–2 points.

This framework delivers new state-of-the-art retrieval and answer extraction, especially at 3–4 hops where error propagation and non-local reasoning are most challenging. End-to-end, multi-hop supervision directly tackles error accumulation and amplifies margin between real vs. spurious reasoning chains.

5. Reasoning-Oriented Prompting and LLM Techniques

A suite of prompting and control paradigms has been evaluated on MuSiQue, revealing its diagnostic power for complex reasoning:

  • Self-prompted Chain-of-Thought (SP-CoT) (Wang et al., 2023): An automated regime for in-context CoT selection and chain generation, SP-CoT nearly doubles plain zero-shot EM/F1 (14.5/22.6% vs 3.1/7.3%) and outperforms Auto-CoT and manual CoT. Intermediate-answer recall reaches ~50%, reflecting directness and transparency in reasoning. Even 13B-parameter LLMs can reach ~10% EM with SP-CoT.
  • FSM-based prompting (Wang et al., 2024): Models the multi-hop process as a finite state machine with explicit decomposition, search, judge-if-continue, and summarize states. FSM-based prompting (FSM1) doubles standard zero-shot F1 (to 38.4 on GPT-3.5, 41.2 on Qwen-72B) and guarantees 100% output format compliance. Major error sources—reasoning loss, decomposition, formatting, hallucination—are mitigated by explicit state enforcement and turn-wise correction.
  • Output length and structure are pivotal: concise, minimal chains correlate with higher accuracy; verbose chains introduce error.

6. Comparative Dataset Properties and Benchmarking Impact

MuSiQue distinguishes itself from previous benchmarks (HotpotQA, 2WikiMultihopQA) by construction:

  • Larger human–model performance gaps and lower disconnected-reasoning scores.
  • Controlled, artifact-resistant splits: no train/test overlap in sub-questions, answers, or paragraphs.
  • Stringent filtering and distractors to preclude exploitable artifacts.
  • Variable reasoning-graph topologies and answer types (span, yes/no, comparison) with short answer lengths (typically 1–2 tokens).
  • Higher challenge reflected in model ablations: removing disconnection filtering or using naive splits trivializes the dataset, underscoring the necessity of design rigor (Trivedi et al., 2021, Zhang et al., 2023).
  • Closed-book vs. retrieval settings: Retrieval-augmented settings yield far higher EM/F1, highlighting the closed-book difficulty and the reliance on models’ parametric world knowledge.

7. Practical Insights, Open Directions, and Recommendations

MuSiQue’s controlled design and depth render it a touchstone for developing, diagnosing, and compressing multi-hop reasoning systems:

  • For closed-book settings, preserving parameter count is essential; quantization minimally compromises reasoning or knowledge.
  • Enforcing output brevity—via prompt design, early stopping, or explicit penalties—substantially improves EM/F1.
  • Retrievers and step-execution methods must be evaluated in settings where every hop is indispensable.
  • The dataset’s robustness against shortcutting and artifacts makes it suitable for ongoing study of LLM prompting, retrieval, and compression protocols.

A plausible implication is that future advances in reasoning-focused architectures, compression with knowledge preservation, and explicit multi-hop control logic can be reliably assessed and guided using MuSiQue as a primary benchmark. The dataset’s stringent construction and comprehensive annotation support both research rigor and systematic ablation analysis across QA paradigms (Zhang et al., 2 Apr 2025, Zhang et al., 2023, Trivedi et al., 2021, Wang et al., 2023, Wang et al., 2024).

Topic to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to MuSiQue.