Papers
Topics
Authors
Recent
2000 character limit reached

Premise Order Matters in Reasoning with Large Language Models (2402.08939v3)

Published 14 Feb 2024 in cs.AI and cs.CL

Abstract: LLMs have accomplished remarkable reasoning performance in various domains. However, in the domain of reasoning tasks, we discover a frailty: LLMs are surprisingly brittle to the ordering of the premises, despite the fact that such ordering does not alter the underlying task. In particular, we observe that LLMs achieve the best performance when the premise order aligns with the context required in intermediate reasoning steps. For example, in deductive reasoning tasks, presenting the premises in the same order as the ground truth proof in the prompt (as opposed to random ordering) drastically increases the model's accuracy. We first examine the effect of premise ordering on deductive reasoning on a variety of LLMs, and our evaluation shows that permuting the premise order can cause a performance drop of over 30%. In addition, we release the benchmark R-GSM, based on GSM8K, to examine the ordering effect for mathematical problem-solving, and we again observe a significant drop in accuracy, relative to the original GSM8K benchmark.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (32)
  1. Word order does matter and shuffled language models know it. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 6907–6919, 2022.
  2. Program synthesis with large language models. arXiv preprint arXiv:2108.07732, 2021.
  3. The reversal curse: Llms trained on" a is b" fail to learn" b is a". arXiv preprint arXiv:2309.12288, 2023.
  4. Sparks of artificial general intelligence: Early experiments with gpt-4. arXiv preprint arXiv:2303.12712, 2023.
  5. Unnatural error correction: Gpt-4 can almost perfectly handle unnatural scrambled text. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 8898–8913, 2023.
  6. Evaluating large language models trained on code. arXiv preprint arXiv:2107.03374, 2021.
  7. V. A. Cicirello. Kendall tau sequence distance: Extending kendall tau from ranks to sequences. arXiv preprint arXiv:1905.02752, 2019.
  8. Training verifiers to solve math word problems. arXiv preprint arXiv:2110.14168, 2021.
  9. Preferred premise order in propositional reasoning: Semantic informativeness and co-reference. Deductive reasoning and strategies, pages 73–95, 2000.
  10. D. Ferreira and A. Freitas. Premise selection in natural language mathematical texts. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 7365–7374, 2020.
  11. Gemini. Gemini: a family of highly capable multimodal models. arXiv preprint arXiv:2312.11805, 2023.
  12. The effect of premise order in conditional reasoning: A test of the mental model theory. Cognition, 63(1):1–28, 1997.
  13. Google. Palm 2 technical report. arXiv preprint arXiv:2305.10403, 2023.
  14. Human-like intuitive behavior and reasoning biases emerged in large language models but disappeared in chatgpt. Nature Computational Science, 3(10):833–838, 2023.
  15. Folio: Natural language reasoning with first-order logic. arXiv preprint arXiv:2209.00840, 2022.
  16. Measuring mathematical problem solving with the math dataset. arXiv preprint arXiv:2103.03874, 2021.
  17. Deepmath-deep sequence models for premise selection. Advances in neural information processing systems, 29, 2016.
  18. E. Jones and J. Steinhardt. Capturing failures of large language models via human cognitive biases. Advances in Neural Information Processing Systems, 35:11785–11799, 2022.
  19. Competition-level code generation with alphacode. Science, 378(6624):1092–1097, 2022.
  20. Embers of autoregression: Understanding large language models through the problem they are trained to solve. arXiv preprint arXiv:2309.13638, 2023.
  21. OpenAI. Gpt-4 technical report. arXiv preprint arXiv:2303.08774, 2023.
  22. A. Saparov and H. He. Language models are greedy reasoners: A systematic formal analysis of chain-of-thought. arXiv preprint arXiv:2210.01240, 2022.
  23. Testing the general deductive reasoning capacity of large language models using ood examples. arXiv preprint arXiv:2305.15269, 2023.
  24. P. K. Sen. Estimates of the regression coefficient based on kendall’s tau. Journal of the American statistical association, 63(324):1379–1389, 1968.
  25. Large language models can be easily distracted by irrelevant context. In International Conference on Machine Learning, pages 31210–31227. PMLR, 2023.
  26. Unnatural language inference. arXiv preprint arXiv:2101.00010, 2020.
  27. A & b== b & a: Triggering logical reasoning failures in large language models. arXiv preprint arXiv:2401.00757, 2024.
  28. Premise selection for theorem proving by deep graph embedding. Advances in neural information processing systems, 30, 2017.
  29. Chain-of-thought prompting elicits reasoning in large language models. Advances in Neural Information Processing Systems, 35:24824–24837, 2022.
  30. Are large language models really good logical reasoners? a comprehensive evaluation from deductive, inductive and abductive views. arXiv preprint arXiv:2306.09841, 2023.
  31. On the paradox of learning to reason from data. arXiv preprint arXiv:2205.11502, 2022.
  32. Large language models can learn rules. arXiv preprint arXiv:2310.07064, 2023.
Citations (19)

Summary

  • The paper shows that LLM reasoning performance drops by over 30% when premise orders do not follow ideal sequential patterns.
  • Methodical experiments on models like GPT-4-turbo, GPT-3.5-turbo, PaLM 2-L, and Gemini Pro demonstrate sensitivity in both deductive and mathematical reasoning tasks.
  • Results advocate for future research to design more robust LLMs that can mitigate errors caused by non-linear premise arrangements.

Impact of Premise Order on LLM Reasoning Performance

Introduction

The reasoning capabilities of LLMs have been extensively explored and celebrated for reaching, and in some cases surpassing, human levels of performance on a variety of tasks. Despite these advancements, the robustness of LLMs, especially concerning the ordering of premises in reasoning tasks, remains an area needing further investigation. This paper uncovers a notable vulnerability in LLMs: their performance on deductive reasoning tasks is significantly influenced by the order in which premises are presented, a factor that should theoretically not affect the outcome of logical reasoning. Through methodical experimentation on a range of state-of-the-art (SoTA) LLMs encompassing models like GPT-4-turbo, GPT-3.5-turbo, PaLM 2-L, and Gemini Pro, the study demonstrates how this premise ordering effect can lead to over a 30% drop in accuracy.

Empirical Evaluation

Logical Reasoning

The research showcases a stark performance degradation in LLMs when confronted with deductive reasoning tasks, stemming solely from permutations in the order of premises. Specifically, LLMs excel when the ordering of premises aligns with the sequence required for intermediate reasoning steps, mimicking the ground truth proof's structure. The study introduces a benchmark adapted from SimpleLogic, consisting of strictly formulated propositional logic problems to isolate and measure the influence of premise order. A key finding here is that LLMs' performance varies, with the best results achieved when premises are arranged in forward order, indicating a preference for linear, step-by-step processing akin to human reasoning patterns.

Mathematical Reasoning

To extend the investigation of premise order effects beyond logical reasoning, the study constructs the R-GSM benchmark, a derivative of the GSM8K benchmark focusing on mathematical problem-solving. Similar to findings in logical reasoning, LLM performance suffers when problem descriptions are reordered, underscoring the impact of premise order on reasoning tasks that require step-by-step calculations.

Analysis

The research offers an in-depth analysis exploring various dimensions of the premise ordering effect, including the impact of distracting premises, the preference of LLMs for different orders, and the magnitude of performance decline across models. Error analyses indicate a common pattern where LLMs are prone to fact hallucination and wrong refutation errors when dealing with less preferred premise orders. This predilection suggests that LLMs tend to process information in a sequential manner, struggling with tasks that necessitate non-linear reasoning.

Future Directions and Implications

This study illuminates a critical vulnerability in current LLM reasoning capabilities, emphasizing the need for more robust models that ensure consistent performance across varied premise orders. The results advocate for future research directions aimed at understanding the underlying causes of this limitation and developing strategies to mitigate its effects. Potential avenues could involve experimenting with different training paradigms, augmenting model architectures, or integrating specialized reasoning modules that better emulate complex human reasoning processes.

Conclusion

By methodically dissecting the impacts of premise order on LLM reasoning, this paper contributes a crucial perspective to the discussion on model robustness and reasoning capabilities. It presents a compelling case for the reconsideration of current approaches in developing and evaluating LLMs, pressing the need for models that can navigate the intricacies of logical reasoning with greater fidelity to the invariances inherent in these tasks.

Whiteboard

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Collections

Sign up for free to add this paper to one or more collections.

Tweets

Sign up for free to view the 15 tweets with 344 likes about this paper.