Papers
Topics
Authors
Recent
Search
2000 character limit reached

MuSiQue Condition in Multihop QA

Updated 17 January 2026
  • MuSiQue condition is a formal set of constraints that ensures genuine multihop reasoning by requiring interdependent subquestions and context.
  • The eight-stage construction pipeline employs techniques like disconnection filtering, adversarial testing, and crowdsourced recomposition to enforce the condition.
  • Empirical evaluations show significant performance gaps between models using shortcut strategies and those performing full connected reasoning, highlighting its robustness.

The MuSiQue condition is a formally defined set of constraints that ensures genuine, connected multihop reasoning in question answering (QA) benchmarks. Introduced in the context of the MuSiQue dataset, the condition enforces that the answer to a multihop question cannot be determined unless a QA system executes all required reasoning steps, thereby excluding shortcuts exploitable by disconnected or single-hop models (Trivedi et al., 2021).

1. Formal Definition and Motivation

The MuSiQue condition addresses a crucial limitation in prior multihop QA benchmarks, where systems could often bypass genuine reasoning by leveraging dataset-specific shortcuts. Formally, multihop QA in MuSiQue is represented as a directed acyclic graph (DAG) GQG_Q with nodes {q1,,qn}\{q_1, \dots, q_n\} as single-hop questions (with respective answers aia_i) and edges (qjqi)(q_j \to q_i) indicating that qiq_i depends critically on knowing aja_j.

The MuSiQue condition is expressed as:

(qj ⁣ ⁣qi)M(#ij;C)aiqiM(qi;)ai\forall\,(q_j\!\to\!q_i)\quad M(\#_i^j;C)\neq a_i \quad\wedge\quad \forall\,q_i\quad M(q_i;\varnothing)\neq a_i

Here, MM is a strong QA model; #ij\#_i^j is qiq_i with all surface mentions of aja_j masked; CC denotes the full set of context paragraphs. The condition requires that:

  • No model can answer any subquestion qiq_i if its predecessor’s answer aja_j is masked from qiq_i and context CC.
  • No subquestion qiq_i can be solved purely from its wording in the absence of context.

This conjunction ensures that the subquestions are interdependent and contextually grounded, preventing models from exploiting disconnected “shortcut” reasoning.

2. Construction Pipeline Enforcing the MuSiQue Condition

The MuSiQue condition is rigorously enforced through an eight-stage pipeline:

  1. Curating Valid Single-Hop QA: Over 500,000 (q,p,a)(q,p,a) triples sourced from SQuAD, Natural Questions, MLQA, T-REx, and Zero-Shot RE are filtered to retain only high-quality non-trivial questions that can be recovered (F1 > 0) by at least one of five large pretrained QA models.
  2. Enumerating Composable Pairs: Pairs (q1,p1,a1)(q_1, p_1, a_1) and (q2,p2,a2)(q_2, p_2, a_2) are composable if a1a_1 is a named-entity present in q2q_2, a2a_2 does not occur in q1q_1, and p1p2p_1 \neq p_2, with robust entity cross-checks using SpaCy types, Wikipedia search, and Wikification models.
  3. Disconnection Filtering: Longformer-based QA models are trained to detect and discard subquestions answerable without predecessors’ answers. A head q1q_1 is rejected if the model's F1 on q1q_1 alone is 0.5\ge 0.5; a tail q2q_2 is rejected if a2a_2 is recovered with F1 0.75\ge 0.75 (answer) and F1 1.0\ge 1.0 (support) even after masking and with distractors.
  4. Building Larger DAGs: Filtered pairs are composed into DAGs of up to four hops, maintaining bounds on question length, token counts, and limiting the reuse of components to maintain diversity.
  5. Minimizing Train/Test Leakage: Multihop questions overlap if they share any subquestion, answer, or support paragraph; train/dev/test splits are greedily constructed to minimize such overlap and prevent trivial generalization via memorization.
  6. Constructing Contexts: Each kk-hop question is paired with kk gold paragraphs and $20-k$ distractors retrieved from the retained single-hop contexts, using masked queries, to increase the difficulty and reduce domain shift exploitation.
  7. Crowdsourced Recomposition: Annotators compose natural-language multihop questions corresponding to each DAG, ensuring bridge entities are co-referential and all reasoning hops are invoked implicitly and necessarily.
  8. Unanswerable Contrast Pairs: For MuSiQue-Full, each answerable question is paired with an unanswerable variant by removing a subquestion’s answer string from all paragraphs, thus requiring models to discern answerability as part of the main task.

Each MuSiQue-Ans instance by construction satisfies the MuSiQue condition for strong trained models.

3. Theoretical Implications and Connected Reasoning

The central aim of the MuSiQue condition is the enforcement of connected reasoning. Only by executing every inferential hop in the DAG GQG_Q can a model arrive at the correct final answer. Masking strategies and context requirements are explicitly designed to block disconnected, purely lexical, or memorization-based approaches. If a single predecessor’s information is masked from a subquestion, a compliant model cannot recover the subanswer. Similarly, no subquestion can be answered without the corresponding supporting context.

This formalism constrains possible system behaviors, leading to tasks that probe decomposition, bridge-entity resolution, and multi-step alignment, which are key desiderata for robust multihop QA evaluation.

4. Empirical Evaluation and Benchmark Difficulty

The empirical findings demonstrate that datasets satisfying the MuSiQue condition are substantially more challenging than prior multihop benchmarks:

  • A single-paragraph baseline achieves ~65 F1 on HotpotQA (20K) but only ~32 F1 on MuSiQue-Ans.
  • A question-only model obtains 19 F1 on HotpotQA, 27 F1 on 2WikiMultihopQA, and only ~5 F1 on MuSiQue-Ans, reflecting strong contextualization requirements.
  • The DiRe “cheatability” metric is ~69 F1 on HotpotQA but drops to ~38 F1 on MuSiQue-Ans, indicating the infeasibility of shortcutting the composition process.
  • The human–machine gap widens to three times the size found in comparable datasets: 78.0 vs. 49.8 F1 on MuSiQue, compared to 84.5 vs. 74.9 F1 on HotpotQA.

For MuSiQue-Full, requiring simultaneous answerability judgment and answer extraction further reduces F1 by 14–17 points (answer+selective-support) and 33–44 points (support+answerability). “Cheating” models collapse near zero.

A summary table from the comparison:

Dataset Single-Para F1 Question-Only F1 DiRe Metric F1 Human-Machine Gap
HotpotQA (20K) ~65 19 ~69 84.5 vs. 74.9
2WikiMultihop 27 83.2 vs. 79.5
MuSiQue-Ans ~32 5 ~38 78.0 vs. 49.8

5. Relation to Dataset Construction and Anti-Shortcut Mechanisms

The MuSiQue condition is realized through several innovations not commonly found in past QA benchmarks:

  • Explicit composability criterion requiring bridge entities.
  • Stringent disconnection filtering by adversarial training and model diagnostics.
  • Anti-leakage train/test splits that block memorization.
  • Contrastive unanswerable pairs that differ by minimal context changes while altering answerability.

These mechanisms together ensure the only viable path to good performance is genuine, connected multihop reasoning, as any direct, memorization, or purely surface-level approaches fail to satisfy the MuSiQue criteria. This contrasts with prior benchmarks, which admit shortcut solutions by design or through data artifacts.

6. Significance and Influence

The introduction of the MuSiQue condition and its enforcement pipeline marks a shift in multihop QA evaluation toward strictly compositional and contextually entangled questions. By making shortcut approaches fundamentally incompatible with dataset structure and targets, MuSiQue benchmarks advance the rigor of multihop QA research and provide a new standard for “cheat-resistant” task construction.

A plausible implication is that techniques developed to meet the MuSiQue condition may also bolster robustness, interpretability, and truly compositional reasoning in emerging QA systems, informing future benchmark and system design (Trivedi et al., 2021).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to MuSiQue Condition.