Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
80 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

MuSiQue: Multihop Questions via Single-hop Question Composition (2108.00573v3)

Published 2 Aug 2021 in cs.CL and cs.AI

Abstract: Multihop reasoning remains an elusive goal as existing multihop benchmarks are known to be largely solvable via shortcuts. Can we create a question answering (QA) dataset that, by construction, \emph{requires} proper multihop reasoning? To this end, we introduce a bottom-up approach that systematically selects composable pairs of single-hop questions that are connected, i.e., where one reasoning step critically relies on information from another. This bottom-up methodology lets us explore a vast space of questions and add stringent filters as well as other mechanisms targeting connected reasoning. It provides fine-grained control over the construction process and the properties of the resulting $k$-hop questions. We use this methodology to create MuSiQue-Ans, a new multihop QA dataset with 25K 2-4 hop questions. Relative to existing datasets, MuSiQue-Ans is more difficult overall (3x increase in human-machine gap), and harder to cheat via disconnected reasoning (e.g., a single-hop model has a 30 point drop in F1). We further add unanswerable contrast questions to produce a more stringent dataset, MuSiQue-Full. We hope our datasets will help the NLP community develop models that perform genuine multihop reasoning.

Overview of MuSiQue: Multihop Questions via Single-hop Question Composition

The paper "MuSiQue: Multihop Questions via Single-hop Question Composition" introduces a novel approach to question answering (QA) focusing on genuine multihop reasoning. The authors present a dataset, MuSiQue, designed to compel models to perform complex reasoning by integrating single-hop questions in a structured manner.

Problem Statement

Current multihop QA benchmarks face criticism for being overly susceptible to shortcuts, allowing models to bypass true multihop reasoning. MuSiQue addresses this by systematically constructing questions that enforce interconnected reasoning steps, ensuring models cannot exploit reasoning shortcuts for high scores.

Methodology

The authors propose a bottom-up strategy to create multihop questions through the composition of single-hop questions. This involves:

  1. Composable Pair Identification: Single-hop questions are paired by identifying shared entities, ensuring the questions are interlinked, forming a DAG (Directed Acyclic Graph).
  2. Ensuring Connected Reasoning: A filtering process ensures that each compositional link between questions cannot be bypassed, adhering to a condition termed the MuSiQue condition.
  3. Dataset Construction Pipeline:
    • Filtering of single-hop questions based on various criteria.
    • Composition of these into multihop questions with 2-4 hops.
    • Reduction of train-test leakage to prevent models from simply memorizing answers.
    • Addition of distractor contexts to ensure challenging model assessments.
    • Human-aided refinement and validation via a crowdsourced approach.

Results

MuSiQue presents notable improvements over existing datasets in several aspects:

  • Increased Difficulty: Models exhibit a larger gap between human performance and machine performance.
  • Reduced Cheatability: The dataset is significantly more robust against shortcut exploitation, as evidenced by lower scores from partial-input models and higher DiRe scores.
  • Challenge Dataset: The inclusion of a contrasting set of unanswerable questions, MuSiQue-Full, further tests the robustness of model reasoning capabilities.

Implications and Future Directions

The MuSiQue dataset promises significant contributions to the advancement of reliable multihop reasoning models. By negating reasoning shortcuts and focusing on connected reasoning, MuSiQue sets a new standard for evaluating multihop QA systems. Given the success of this approach, future exploration may consider extending similar methodologies to other areas, such as open-domain QA or multimodal datasets, potentially enhancing the ability of AI systems to engage in more complex reasoning tasks.

This paper invites further investigation into decomposition-based models and could foster the development of AI systems capable of tackling real-world, multifaceted problems through rigorous reasoning processes.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (4)
  1. Harsh Trivedi (29 papers)
  2. Niranjan Balasubramanian (53 papers)
  3. Tushar Khot (53 papers)
  4. Ashish Sabharwal (84 papers)
Citations (166)
Youtube Logo Streamline Icon: https://streamlinehq.com