Premise Order Matters in Reasoning with Large Language Models (2402.08939v3)

Published 14 Feb 2024 in cs.AI and cs.CL

Abstract: LLMs have accomplished remarkable reasoning performance in various domains. However, in the domain of reasoning tasks, we discover a frailty: LLMs are surprisingly brittle to the ordering of the premises, despite the fact that such ordering does not alter the underlying task. In particular, we observe that LLMs achieve the best performance when the premise order aligns with the context required in intermediate reasoning steps. For example, in deductive reasoning tasks, presenting the premises in the same order as the ground truth proof in the prompt (as opposed to random ordering) drastically increases the model's accuracy. We first examine the effect of premise ordering on deductive reasoning on a variety of LLMs, and our evaluation shows that permuting the premise order can cause a performance drop of over 30%. In addition, we release the benchmark R-GSM, based on GSM8K, to examine the ordering effect for mathematical problem-solving, and we again observe a significant drop in accuracy, relative to the original GSM8K benchmark.

References (32)

Citations (19)

View on Semantic Scholar

Summary

The paper shows that LLM reasoning performance drops by over 30% when premise orders do not follow ideal sequential patterns.
Methodical experiments on models like GPT-4-turbo, GPT-3.5-turbo, PaLM 2-L, and Gemini Pro demonstrate sensitivity in both deductive and mathematical reasoning tasks.
Results advocate for future research to design more robust LLMs that can mitigate errors caused by non-linear premise arrangements.

Impact of Premise Order on LLM Reasoning Performance

Introduction

The reasoning capabilities of LLMs have been extensively explored and celebrated for reaching, and in some cases surpassing, human levels of performance on a variety of tasks. Despite these advancements, the robustness of LLMs, especially concerning the ordering of premises in reasoning tasks, remains an area needing further investigation. This paper uncovers a notable vulnerability in LLMs: their performance on deductive reasoning tasks is significantly influenced by the order in which premises are presented, a factor that should theoretically not affect the outcome of logical reasoning. Through methodical experimentation on a range of state-of-the-art (SoTA) LLMs encompassing models like GPT-4-turbo, GPT-3.5-turbo, PaLM 2-L, and Gemini Pro, the study demonstrates how this premise ordering effect can lead to over a 30% drop in accuracy.

Empirical Evaluation

Logical Reasoning

The research showcases a stark performance degradation in LLMs when confronted with deductive reasoning tasks, stemming solely from permutations in the order of premises. Specifically, LLMs excel when the ordering of premises aligns with the sequence required for intermediate reasoning steps, mimicking the ground truth proof's structure. The study introduces a benchmark adapted from SimpleLogic, consisting of strictly formulated propositional logic problems to isolate and measure the influence of premise order. A key finding here is that LLMs' performance varies, with the best results achieved when premises are arranged in forward order, indicating a preference for linear, step-by-step processing akin to human reasoning patterns.

Mathematical Reasoning

To extend the investigation of premise order effects beyond logical reasoning, the study constructs the R-GSM benchmark, a derivative of the GSM8K benchmark focusing on mathematical problem-solving. Similar to findings in logical reasoning, LLM performance suffers when problem descriptions are reordered, underscoring the impact of premise order on reasoning tasks that require step-by-step calculations.

Analysis

The research offers an in-depth analysis exploring various dimensions of the premise ordering effect, including the impact of distracting premises, the preference of LLMs for different orders, and the magnitude of performance decline across models. Error analyses indicate a common pattern where LLMs are prone to fact hallucination and wrong refutation errors when dealing with less preferred premise orders. This predilection suggests that LLMs tend to process information in a sequential manner, struggling with tasks that necessitate non-linear reasoning.

Future Directions and Implications

This study illuminates a critical vulnerability in current LLM reasoning capabilities, emphasizing the need for more robust models that ensure consistent performance across varied premise orders. The results advocate for future research directions aimed at understanding the underlying causes of this limitation and developing strategies to mitigate its effects. Potential avenues could involve experimenting with different training paradigms, augmenting model architectures, or integrating specialized reasoning modules that better emulate complex human reasoning processes.

Conclusion

By methodically dissecting the impacts of premise order on LLM reasoning, this paper contributes a crucial perspective to the discussion on model robustness and reasoning capabilities. It presents a compelling case for the reconsideration of current approaches in developing and evaluating LLMs, pressing the need for models that can navigate the intricacies of logical reasoning with greater fidelity to the invariances inherent in these tasks.