MuSiQue: Multihop QA Benchmark
- MuSiQue is a family of multihop QA benchmarks that uses directed acyclic graphs to enforce step-by-step, connected reasoning.
- The benchmark employs stringent compositional filtering and eight-stage pipelines to prevent shortcuts and artifacts in question construction.
- Evaluation results and compression studies illustrate significant human-model gaps and guide advancements in retrieval, prompting, and model design.
MuSiQue is a family of multihop question answering (QA) benchmarks designed to rigorously assess a model’s capability to perform true multi-step, connected reasoning under both closed-book and open-domain retrieval settings. By enforcing structural properties and applying stringent compositional filtering, MuSiQue sets a new standard for testing multi-hop reasoning in natural language processing, revealing shortcomings in both pre-trained models and recent prompting-based approaches.
1. Formal Definition, Dataset Structure, and Construction
MuSiQue’s foundation rests on constructing -hop questions through the composition of single-hop QA pairs, captured as directed acyclic reasoning graphs (DAGs) where nodes are sub-questions, and indicates that ’s answer is contingent on ’s answer. Each -hop question requires a chain or more complex graph structure (chains, trees, diamonds; six patterns in total), with each sub-question answer unavailable in context unless previous reasoning is carried out.
Composability and Connectedness: Two single-hop pairs , are composable if is a named entity present verbatim in , is not present in , and paragraphs are distinct and disambiguated. The MuSiQue condition enforces that no model, even strong neural QA models, can answer any hop without information from its predecessors. This is formalized by requiring, for each edge ,
ensuring no one-hop shortcut or context artifact suffices (Trivedi et al., 2021).
Pipeline: The construction proceeds in eight algorithmic stages: filtering “bad” single-hop questions, extracting composable pairs, applying strong disconnection filtering, assembling higher-hop DAGs, partitioning splits to avoid leakage (no overlap in sub-questions, answers, paragraphs across splits), generating contexts with hard BM25 distractors, crowdsourcing final question composition for fluency and connectedness, and generating paired unanswerable variants (for MuSiQue-Full).
Variants:
- MuSiQue-Ans: 24,814 multihop questions (2–4 hops), with detailed sub-question decomposition and answer spans; ~21,000 unique single-hop subquestions; 7,676 unique supporting paragraphs.
- MuSiQue-Full: Adds an unanswerable variant for every answerable question (total 50,000), enforcing detection of answerability, answer, and supporting paragraphs in tandem.
Controlled properties include enforced connectedness, mixed-hop distribution, hard distractors, no leakage of information between splits, and a split of answerable/unanswerable contrasts (Trivedi et al., 2021).
2. Evaluation Protocols and Baseline Results
Evaluation is based on exact match (EM) and token-level F1 for answers, with paragraph-level EM/F1 for supporting paragraph retrieval. For unanswerable cases (MuSiQue-Full), composite metrics consider both answerability and support retrieval.
Results on MuSiQue-Ans indicate a large gap between human and model performance:
| Dataset | Human Answer F1 | Best Model F1 | Human–Model Gap |
|---|---|---|---|
| HotpotQA-20K | 84.5 | 74.9 (SA) | 9.6 |
| 2Wiki-20K | 83.2 | 79.8 (EX) | 3.4 |
| MuSiQue-Ans | 78.0 | 47.3 (EX(SA)) | 30.7 |
Disconnected reasoning diagnostics (“cheatability”): MuSiQue-Ans restricts artifact-exploiting models to ~37.8% (answer) and ~63.4% (support), about half that of prior benchmarks. Naive splits or removal of disconnection filtering produces trivial tasks (model F1 up to 87.3, 1-Para F1 85.1).
Full details on splits, metrics, and human annotation are rigorously accounted for, with ~30 pt F1 drops (human–model) and severe collapse when presented only with context or question (Trivedi et al., 2021).
3. Applications in LLM Compression and Model Design
MuSiQue exposes distinct sensitivities of reasoning vs. knowledge memorization for large reasoning model (LRM) compression:
- Quantization, which keeps parameter count but reduces bit precision (to 2.51, 1.73, 1.58 bits), is highly effective: e.g., DeepSeek-R1 (FP16, 671B params) at EM 17.0, F1 27.51 degrades only to EM 14.0, F1 22.34 in the most aggressive quantization.
- Distillation/pruning inflicts sharply steeper drops: R1-Distill-Llama-70B drops to EM 13.0, F1 21.8; Distill-Qwen-32B collapses to near zero (EM 1.0). Aggressive pruning leads to catastrophic failure beyond 30% sparsity.
Empirical finding: parameter count is far more critical for retaining the parametric knowledge needed by MuSiQue than for reasoning per se—quantization preserves knowledge; size reduction destroys it. Output conciseness is key: short outputs (“shortest 30%”) achieve up to 30% EM and 42.8% F1, while verbose outputs (“longest 30%”) deteriorate to 3.3% EM and 10.0% F1 (Zhang et al., 2 Apr 2025).
Compression recommendations:
- Favor quantization over distillation/pruning for closed-book multihop tasks.
- Limit pruning to ≤30% sparsity.
- Use the largest possible student for distillation.
- Enforce answer brevity for higher closed-book accuracy.
4. Advances in Retrieval and Multi-Hop QA Systems
MuSiQue and MuSiQue-Ans have catalyzed innovation in retriever and reader design for multi-hop tasks:
- Beam Retrieval (Zhang et al., 2023): Introduces end-to-end beam-based retriever with joint optimization and multiple hypothesis chains per hop, showing a +44.6% EM gain (53.5→79.3) and +20.2 Answer F1 points (49.0→69.2) over prior art.
- Success factors include supervision across all hops, beam expansion to cover multiple chains (rescuing from dead ends), and a two-head classifier architecture. Beam‐size consistency and robust training are essential; shuffling passage order further improves support/answer F1 by 1–2 points.
This framework delivers new state-of-the-art retrieval and answer extraction, especially at 3–4 hops where error propagation and non-local reasoning are most challenging. End-to-end, multi-hop supervision directly tackles error accumulation and amplifies margin between real vs. spurious reasoning chains.
5. Reasoning-Oriented Prompting and LLM Techniques
A suite of prompting and control paradigms has been evaluated on MuSiQue, revealing its diagnostic power for complex reasoning:
- Self-prompted Chain-of-Thought (SP-CoT) (Wang et al., 2023): An automated regime for in-context CoT selection and chain generation, SP-CoT nearly doubles plain zero-shot EM/F1 (14.5/22.6% vs 3.1/7.3%) and outperforms Auto-CoT and manual CoT. Intermediate-answer recall reaches ~50%, reflecting directness and transparency in reasoning. Even 13B-parameter LLMs can reach ~10% EM with SP-CoT.
- FSM-based prompting (Wang et al., 2024): Models the multi-hop process as a finite state machine with explicit decomposition, search, judge-if-continue, and summarize states. FSM-based prompting (FSM1) doubles standard zero-shot F1 (to 38.4 on GPT-3.5, 41.2 on Qwen-72B) and guarantees 100% output format compliance. Major error sources—reasoning loss, decomposition, formatting, hallucination—are mitigated by explicit state enforcement and turn-wise correction.
- Output length and structure are pivotal: concise, minimal chains correlate with higher accuracy; verbose chains introduce error.
6. Comparative Dataset Properties and Benchmarking Impact
MuSiQue distinguishes itself from previous benchmarks (HotpotQA, 2WikiMultihopQA) by construction:
- Larger human–model performance gaps and lower disconnected-reasoning scores.
- Controlled, artifact-resistant splits: no train/test overlap in sub-questions, answers, or paragraphs.
- Stringent filtering and distractors to preclude exploitable artifacts.
- Variable reasoning-graph topologies and answer types (span, yes/no, comparison) with short answer lengths (typically 1–2 tokens).
- Higher challenge reflected in model ablations: removing disconnection filtering or using naive splits trivializes the dataset, underscoring the necessity of design rigor (Trivedi et al., 2021, Zhang et al., 2023).
- Closed-book vs. retrieval settings: Retrieval-augmented settings yield far higher EM/F1, highlighting the closed-book difficulty and the reliance on models’ parametric world knowledge.
7. Practical Insights, Open Directions, and Recommendations
MuSiQue’s controlled design and depth render it a touchstone for developing, diagnosing, and compressing multi-hop reasoning systems:
- For closed-book settings, preserving parameter count is essential; quantization minimally compromises reasoning or knowledge.
- Enforcing output brevity—via prompt design, early stopping, or explicit penalties—substantially improves EM/F1.
- Retrievers and step-execution methods must be evaluated in settings where every hop is indispensable.
- The dataset’s robustness against shortcutting and artifacts makes it suitable for ongoing study of LLM prompting, retrieval, and compression protocols.
A plausible implication is that future advances in reasoning-focused architectures, compression with knowledge preservation, and explicit multi-hop control logic can be reliably assessed and guided using MuSiQue as a primary benchmark. The dataset’s stringent construction and comprehensive annotation support both research rigor and systematic ablation analysis across QA paradigms (Zhang et al., 2 Apr 2025, Zhang et al., 2023, Trivedi et al., 2021, Wang et al., 2023, Wang et al., 2024).