MuSiQue Condition in Multihop QA
- MuSiQue condition is a formal set of constraints that ensures genuine multihop reasoning by requiring interdependent subquestions and context.
- The eight-stage construction pipeline employs techniques like disconnection filtering, adversarial testing, and crowdsourced recomposition to enforce the condition.
- Empirical evaluations show significant performance gaps between models using shortcut strategies and those performing full connected reasoning, highlighting its robustness.
The MuSiQue condition is a formally defined set of constraints that ensures genuine, connected multihop reasoning in question answering (QA) benchmarks. Introduced in the context of the MuSiQue dataset, the condition enforces that the answer to a multihop question cannot be determined unless a QA system executes all required reasoning steps, thereby excluding shortcuts exploitable by disconnected or single-hop models (Trivedi et al., 2021).
1. Formal Definition and Motivation
The MuSiQue condition addresses a crucial limitation in prior multihop QA benchmarks, where systems could often bypass genuine reasoning by leveraging dataset-specific shortcuts. Formally, multihop QA in MuSiQue is represented as a directed acyclic graph (DAG) with nodes as single-hop questions (with respective answers ) and edges indicating that depends critically on knowing .
The MuSiQue condition is expressed as:
Here, is a strong QA model; is with all surface mentions of masked; denotes the full set of context paragraphs. The condition requires that:
- No model can answer any subquestion if its predecessor’s answer is masked from and context .
- No subquestion can be solved purely from its wording in the absence of context.
This conjunction ensures that the subquestions are interdependent and contextually grounded, preventing models from exploiting disconnected “shortcut” reasoning.
2. Construction Pipeline Enforcing the MuSiQue Condition
The MuSiQue condition is rigorously enforced through an eight-stage pipeline:
- Curating Valid Single-Hop QA: Over 500,000 triples sourced from SQuAD, Natural Questions, MLQA, T-REx, and Zero-Shot RE are filtered to retain only high-quality non-trivial questions that can be recovered (F1 > 0) by at least one of five large pretrained QA models.
- Enumerating Composable Pairs: Pairs and are composable if is a named-entity present in , does not occur in , and , with robust entity cross-checks using SpaCy types, Wikipedia search, and Wikification models.
- Disconnection Filtering: Longformer-based QA models are trained to detect and discard subquestions answerable without predecessors’ answers. A head is rejected if the model's F1 on alone is ; a tail is rejected if is recovered with F1 (answer) and F1 (support) even after masking and with distractors.
- Building Larger DAGs: Filtered pairs are composed into DAGs of up to four hops, maintaining bounds on question length, token counts, and limiting the reuse of components to maintain diversity.
- Minimizing Train/Test Leakage: Multihop questions overlap if they share any subquestion, answer, or support paragraph; train/dev/test splits are greedily constructed to minimize such overlap and prevent trivial generalization via memorization.
- Constructing Contexts: Each -hop question is paired with gold paragraphs and $20-k$ distractors retrieved from the retained single-hop contexts, using masked queries, to increase the difficulty and reduce domain shift exploitation.
- Crowdsourced Recomposition: Annotators compose natural-language multihop questions corresponding to each DAG, ensuring bridge entities are co-referential and all reasoning hops are invoked implicitly and necessarily.
- Unanswerable Contrast Pairs: For MuSiQue-Full, each answerable question is paired with an unanswerable variant by removing a subquestion’s answer string from all paragraphs, thus requiring models to discern answerability as part of the main task.
Each MuSiQue-Ans instance by construction satisfies the MuSiQue condition for strong trained models.
3. Theoretical Implications and Connected Reasoning
The central aim of the MuSiQue condition is the enforcement of connected reasoning. Only by executing every inferential hop in the DAG can a model arrive at the correct final answer. Masking strategies and context requirements are explicitly designed to block disconnected, purely lexical, or memorization-based approaches. If a single predecessor’s information is masked from a subquestion, a compliant model cannot recover the subanswer. Similarly, no subquestion can be answered without the corresponding supporting context.
This formalism constrains possible system behaviors, leading to tasks that probe decomposition, bridge-entity resolution, and multi-step alignment, which are key desiderata for robust multihop QA evaluation.
4. Empirical Evaluation and Benchmark Difficulty
The empirical findings demonstrate that datasets satisfying the MuSiQue condition are substantially more challenging than prior multihop benchmarks:
- A single-paragraph baseline achieves ~65 F1 on HotpotQA (20K) but only ~32 F1 on MuSiQue-Ans.
- A question-only model obtains 19 F1 on HotpotQA, 27 F1 on 2WikiMultihopQA, and only ~5 F1 on MuSiQue-Ans, reflecting strong contextualization requirements.
- The DiRe “cheatability” metric is ~69 F1 on HotpotQA but drops to ~38 F1 on MuSiQue-Ans, indicating the infeasibility of shortcutting the composition process.
- The human–machine gap widens to three times the size found in comparable datasets: 78.0 vs. 49.8 F1 on MuSiQue, compared to 84.5 vs. 74.9 F1 on HotpotQA.
For MuSiQue-Full, requiring simultaneous answerability judgment and answer extraction further reduces F1 by 14–17 points (answer+selective-support) and 33–44 points (support+answerability). “Cheating” models collapse near zero.
A summary table from the comparison:
| Dataset | Single-Para F1 | Question-Only F1 | DiRe Metric F1 | Human-Machine Gap |
|---|---|---|---|---|
| HotpotQA (20K) | ~65 | 19 | ~69 | 84.5 vs. 74.9 |
| 2WikiMultihop | — | 27 | — | 83.2 vs. 79.5 |
| MuSiQue-Ans | ~32 | 5 | ~38 | 78.0 vs. 49.8 |
5. Relation to Dataset Construction and Anti-Shortcut Mechanisms
The MuSiQue condition is realized through several innovations not commonly found in past QA benchmarks:
- Explicit composability criterion requiring bridge entities.
- Stringent disconnection filtering by adversarial training and model diagnostics.
- Anti-leakage train/test splits that block memorization.
- Contrastive unanswerable pairs that differ by minimal context changes while altering answerability.
These mechanisms together ensure the only viable path to good performance is genuine, connected multihop reasoning, as any direct, memorization, or purely surface-level approaches fail to satisfy the MuSiQue criteria. This contrasts with prior benchmarks, which admit shortcut solutions by design or through data artifacts.
6. Significance and Influence
The introduction of the MuSiQue condition and its enforcement pipeline marks a shift in multihop QA evaluation toward strictly compositional and contextually entangled questions. By making shortcut approaches fundamentally incompatible with dataset structure and targets, MuSiQue benchmarks advance the rigor of multihop QA research and provide a new standard for “cheat-resistant” task construction.
A plausible implication is that techniques developed to meet the MuSiQue condition may also bolster robustness, interpretability, and truly compositional reasoning in emerging QA systems, informing future benchmark and system design (Trivedi et al., 2021).