Exploring the Compositional Deficiency of Large Language Models in Mathematical Reasoning (2405.06680v4)
Abstract: Human cognition exhibits systematic compositionality, the algebraic ability to generate infinite novel combinations from finite learned components, which is the key to understanding and reasoning about complex logic. In this work, we investigate the compositionality of LLMs in mathematical reasoning. Specifically, we construct a new dataset \textsc{MathTrap} by introducing carefully designed logical traps into the problem descriptions of MATH and GSM8K. Since problems with logical flaws are quite rare in the real world, these represent "unseen" cases to LLMs. Solving these requires the models to systematically compose (1) the mathematical knowledge involved in the original problems with (2) knowledge related to the introduced traps. Our experiments show that while LLMs possess both components of requisite knowledge, they do not \textbf{spontaneously} combine them to handle these novel cases. We explore several methods to mitigate this deficiency, such as natural language prompts, few-shot demonstrations, and fine-tuning. Additionally, we test the recently released OpenAI o1 model and find that human-like `slow thinking' helps improve the compositionality of LLMs. Overall, systematic compositionality remains an open challenge for LLMs.
- Llemma: An open language model for mathematics. Preprint, arXiv:2310.10631.
- Gregor Bachmann and Vaishnavh Nagarajan. 2024. The pitfalls of next-token prediction. Preprint, arXiv:2403.06963.
- Chatgpt is a knowledgeable but inexperienced solver: An investigation of commonsense problem in large language models. Preprint, arXiv:2303.16421.
- Sparks of artificial general intelligence: Early experiments with gpt-4. Preprint, arXiv:2303.12712.
- Faith and fate: Limits of transformers on compositionality. Preprint, arXiv:2305.18654.
- Jerry A Fodor and Zenon W Pylyshyn. 1988. Connectionism and cognitive architecture: A critical analysis. Cognition, 28(1-2):3–71.
- Tora: A tool-integrated reasoning agent for mathematical problem solving. Preprint, arXiv:2309.17452.
- Deepseek-coder: When the large language model meets programming – the rise of code intelligence. Preprint, arXiv:2401.14196.
- Philipp Koralus and Vincent Wang-Maścianica. 2023. Humans in humans out: On gpt converging toward common sense in both success and failure. Preprint, arXiv:2303.17276.
- Brenden Lake and Marco Baroni. 2023. Human-like systematic generalization through a meta-learning neural network. Nature, 623.
- Yann LeCun. 2024. Do large language models need sensory grounding for meaning and understanding?
- Wizardmath: Empowering mathematical reasoning for large language models via reinforced evol-instruct. Preprint, arXiv:2308.09583.
- OpenAI. 2023. Gpt-4 technical report. Preprint, arXiv:2303.08774.
- On the paradox of generalizable logical reasoning in large language models.
- Openmathinstruct-1: A 1.8 million math instruction tuning dataset. Preprint, arXiv:2402.10176.
- Metamath: Bootstrap your own mathematical questions for large language models. Preprint, arXiv:2309.12284.
- Opencodeinterpreter: Integrating code generation with execution and refinement. Preprint, arXiv:2402.14658.
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Top Community Prompts
Collections
Sign up for free to add this paper to one or more collections.