MuSR: Testing the Limits of Chain-of-thought with Multistep Soft Reasoning (2310.16049v2)

Published 24 Oct 2023 in cs.CL

Abstract: While LLMs equipped with techniques like chain-of-thought prompting have demonstrated impressive capabilities, they still fall short in their ability to reason robustly in complex settings. However, evaluating LLM reasoning is challenging because system capabilities continue to grow while benchmark datasets for tasks like logical deduction have remained static. We introduce MuSR, a dataset for evaluating LLMs on multistep soft reasoning tasks specified in a natural language narrative. This dataset has two crucial features. First, it is created through a novel neurosymbolic synthetic-to-natural generation algorithm, enabling the construction of complex reasoning instances that challenge GPT-4 (e.g., murder mysteries roughly 1000 words in length) and which can be scaled further as more capable LLMs are released. Second, our dataset instances are free text narratives corresponding to real-world domains of reasoning; this makes it simultaneously much more challenging than other synthetically-crafted benchmarks while remaining realistic and tractable for human annotators to solve with high accuracy. We evaluate a range of LLMs and prompting techniques on this dataset and characterize the gaps that remain for techniques like chain-of-thought to perform robust reasoning.

PDF Abstract

Evaluating LLM Reasoning: The MuSR Dataset

The research introduces MuSR, a dataset designed for evaluating LLMs on multistep soft reasoning tasks. This paper recognizes the limitations of existing benchmarks, which have not evolved in line with the increasing capabilities of modern LLMs such as GPT-4. By blending natural language narratives and complex, real-world reasoning tasks, this dataset provides a much-needed update to evaluate the reasoning faculties of LLMs, particularly in multifaceted situations that require both commonsense and rigorous logical deduction.

Dataset Construction

MuSR is generated using a distinctive neurosymbolic synthetic-to-natural generation algorithm that constructs scenarios - such as murder mysteries up to a thousand words long - capable of challenging even the leading LLMs at present. The authors highlight the innovation behind their neurosymbolic dataset generation procedure, employing reasoning trees to generate compelling narratives that incorporate multistep commonsense inferences. This structured approach addresses gaps in prior benchmarks, which were either inherently solvable by rule-based systems or lacked the natural complexity MuSR offers.

Strong Numerical Results

In testing the dataset, MuSR presents 750 examples divided across domains including murder mysteries, object placement, and team assignment. These examples were gauged against various models, including GPT-4, Llama 2, and Vicuna. Notably, while GPT-4 outperformed other models, achieving 80.4% on murder mysteries, it still fell short compared to human participants who scored between 88.2% and 100%. These results underscore the enduring gap between machine and human reasoning in complex narrative environments.

Implications and Future Research

Theoretical and practical implications of this research are substantial. The complexity and realistic grounding of the MuSR dataset showcase its potential as a robust benchmark for future LLM development. Practically, the dataset's design focuses attention on the areas requiring enhancement in LLMs, particularly when it comes to synthesizing narrative data and applying logical reasoning over extended sequences.

The framework proposed by the authors can also influence neurosymbolic approaches beyond LLMs. By segregating narrative creation from logical reasoning, MuSR facilitates the continuous evolution of test cases, maintaining a consistently challenging benchmark as LLMs progress. It acknowledges and counters the tendency of existing datasets to become obsolete as LLM capabilities improve.

Looking forward, the dataset's neurosymbolic generation method can be scaled in complexity, allowing for even more challenging reasoning tasks and promoting sustained AI advancement. It is likely that continued iterations on this dataset will stimulate innovations both in LLM architectures and in the methodologies adopted for model training and evaluation. As future developments in AI unfold, datasets like MuSR will be invaluable in bridging the divide between machine comprehension and human understanding.