Premise Order Matters in Reasoning with Large Language Models (2402.08939v3)
Abstract: LLMs have accomplished remarkable reasoning performance in various domains. However, in the domain of reasoning tasks, we discover a frailty: LLMs are surprisingly brittle to the ordering of the premises, despite the fact that such ordering does not alter the underlying task. In particular, we observe that LLMs achieve the best performance when the premise order aligns with the context required in intermediate reasoning steps. For example, in deductive reasoning tasks, presenting the premises in the same order as the ground truth proof in the prompt (as opposed to random ordering) drastically increases the model's accuracy. We first examine the effect of premise ordering on deductive reasoning on a variety of LLMs, and our evaluation shows that permuting the premise order can cause a performance drop of over 30%. In addition, we release the benchmark R-GSM, based on GSM8K, to examine the ordering effect for mathematical problem-solving, and we again observe a significant drop in accuracy, relative to the original GSM8K benchmark.
- Word order does matter and shuffled language models know it. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 6907–6919, 2022.
- Program synthesis with large language models. arXiv preprint arXiv:2108.07732, 2021.
- The reversal curse: Llms trained on" a is b" fail to learn" b is a". arXiv preprint arXiv:2309.12288, 2023.
- Sparks of artificial general intelligence: Early experiments with gpt-4. arXiv preprint arXiv:2303.12712, 2023.
- Unnatural error correction: Gpt-4 can almost perfectly handle unnatural scrambled text. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 8898–8913, 2023.
- Evaluating large language models trained on code. arXiv preprint arXiv:2107.03374, 2021.
- V. A. Cicirello. Kendall tau sequence distance: Extending kendall tau from ranks to sequences. arXiv preprint arXiv:1905.02752, 2019.
- Training verifiers to solve math word problems. arXiv preprint arXiv:2110.14168, 2021.
- Preferred premise order in propositional reasoning: Semantic informativeness and co-reference. Deductive reasoning and strategies, pages 73–95, 2000.
- D. Ferreira and A. Freitas. Premise selection in natural language mathematical texts. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 7365–7374, 2020.
- Gemini. Gemini: a family of highly capable multimodal models. arXiv preprint arXiv:2312.11805, 2023.
- The effect of premise order in conditional reasoning: A test of the mental model theory. Cognition, 63(1):1–28, 1997.
- Google. Palm 2 technical report. arXiv preprint arXiv:2305.10403, 2023.
- Human-like intuitive behavior and reasoning biases emerged in large language models but disappeared in chatgpt. Nature Computational Science, 3(10):833–838, 2023.
- Folio: Natural language reasoning with first-order logic. arXiv preprint arXiv:2209.00840, 2022.
- Measuring mathematical problem solving with the math dataset. arXiv preprint arXiv:2103.03874, 2021.
- Deepmath-deep sequence models for premise selection. Advances in neural information processing systems, 29, 2016.
- E. Jones and J. Steinhardt. Capturing failures of large language models via human cognitive biases. Advances in Neural Information Processing Systems, 35:11785–11799, 2022.
- Competition-level code generation with alphacode. Science, 378(6624):1092–1097, 2022.
- Embers of autoregression: Understanding large language models through the problem they are trained to solve. arXiv preprint arXiv:2309.13638, 2023.
- OpenAI. Gpt-4 technical report. arXiv preprint arXiv:2303.08774, 2023.
- A. Saparov and H. He. Language models are greedy reasoners: A systematic formal analysis of chain-of-thought. arXiv preprint arXiv:2210.01240, 2022.
- Testing the general deductive reasoning capacity of large language models using ood examples. arXiv preprint arXiv:2305.15269, 2023.
- P. K. Sen. Estimates of the regression coefficient based on kendall’s tau. Journal of the American statistical association, 63(324):1379–1389, 1968.
- Large language models can be easily distracted by irrelevant context. In International Conference on Machine Learning, pages 31210–31227. PMLR, 2023.
- Unnatural language inference. arXiv preprint arXiv:2101.00010, 2020.
- A & b== b & a: Triggering logical reasoning failures in large language models. arXiv preprint arXiv:2401.00757, 2024.
- Premise selection for theorem proving by deep graph embedding. Advances in neural information processing systems, 30, 2017.
- Chain-of-thought prompting elicits reasoning in large language models. Advances in Neural Information Processing Systems, 35:24824–24837, 2022.
- Are large language models really good logical reasoners? a comprehensive evaluation from deductive, inductive and abductive views. arXiv preprint arXiv:2306.09841, 2023.
- On the paradox of learning to reason from data. arXiv preprint arXiv:2205.11502, 2022.
- Large language models can learn rules. arXiv preprint arXiv:2310.07064, 2023.
Sponsor
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Top Community Prompts
Collections
Sign up for free to add this paper to one or more collections.