Large Language Models Struggle with Unreasonability in Math Problems
Abstract: LLMs have shown remarkable success on a wide range of math and reasoning benchmarks. However, we observe that they often struggle when faced with unreasonable math problems. Instead of recognizing these issues, models frequently proceed as if the problem is well-posed, producing incorrect answers or falling into overthinking and verbose self-correction. To systematically investigate this overlooked vulnerability, we propose the \textbf{Unreasonable Math Problems (UMP)} benchmark, designed to evaluate LLMs' ability to detect and respond to unreasonable math problem statements. Based on extensive experiments covering 19 LLMs, we find that even state-of-the-art general models like GPT-4o achieve only a score of 0.6 on UMP. While reasoning models such as DeepSeek-R1 demonstrate a higher sensitivity to unreasonable inputs, this often comes at the cost of generating overly long and meaningless responses that fail to converge. We further explore prompting and fine-tuning methods, which offer partial improvements but also introduce trade-offs, shedding light on both the potential and limitations of LLMs in this challenging setting.
- 2023. Gemini: A family of highly capable multimodal models.
- 2023. Gpt-4 technical report.
- Qwen technical report. arXiv preprint arXiv:2309.16609.
- Baichuan. 2023. Baichuan 2: Open large-scale language models. arXiv preprint arXiv:2309.10305.
- Training verifiers to solve math word problems.
- Glm: General language model pretraining with autoregressive blank infilling. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 320–335.
- Complexity-based prompting for multi-step reasoning.
- Measuring mathematical problem solving with the math dataset. NeurIPS.
- Large language models are zero-shot reasoners. In Advances in Neural Information Processing Systems, volume 35, pages 22199–22213. Curran Associates, Inc.
- Learning to detect unseen object classes by between-class attribute transfer. In 2009 IEEE Conference on Computer Vision and Pattern Recognition, pages 951–958.
- Wizardmath: Empowering mathematical reasoning for large language models via reinforced evol-instruct.
- Llama 2: Open foundation and fine-tuned chat models.
- Plan-and-solve prompting: Improving zero-shot chain-of-thought reasoning by large language models. arXiv preprint arXiv:2305.04091.
- Chain-of-thought prompting elicits reasoning in large language models. In Advances in Neural Information Processing Systems, volume 35, pages 24824–24837. Curran Associates, Inc.
- Outcome-supervised verifiers for planning in mathematical reasoning.
- Metamath: Bootstrap your own mathematical questions for large language models.
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Top Community Prompts
Collections
Sign up for free to add this paper to one or more collections.