ReasonAgain: Using Extractable Symbolic Programs to Evaluate Mathematical Reasoning (2410.19056v1)
Abstract: Existing math datasets evaluate the reasoning abilities of LLMs by either using the final answer or the intermediate reasoning steps derived from static examples. However, the former approach fails to surface model's uses of shortcuts and wrong reasoning while the later poses challenges in accommodating alternative solutions. In this work, we seek to use symbolic programs as a means for automated evaluation if a model can consistently produce correct final answers across various inputs to the program. We begin by extracting programs for popular math datasets (GSM8K and MATH) using GPT4-o. For those executable programs verified using the original input-output pairs, they are found to encapsulate the proper reasoning required to solve the original text questions. We then prompt GPT4-o to generate new questions using alternative input-output pairs based the extracted program. We apply the resulting datasets to evaluate a collection of LLMs. In our experiments, we observe significant accuracy drops using our proposed evaluation compared with original static examples, suggesting the fragility of math reasoning in state-of-the-art LLMs.
- Training verifiers to solve math word problems. arXiv preprint arXiv:2110.14168.
- The llama 3 herd of models. Preprint, arXiv:2407.21783.
- Pal: Program-aided language models. In International Conference on Machine Learning, pages 10764–10799. PMLR.
- Evaluating models’ local decision boundaries via contrast sets. In Findings of the Association for Computational Linguistics: EMNLP 2020, pages 1307–1323.
- Roscoe: A suite of metrics for scoring step-by-step reasoning. Preprint, arXiv:2212.07919.
- Reasoning with language model is planning with world model. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 8154–8173.
- Measuring mathematical problem solving with the math dataset. arXiv preprint arXiv:2103.03874.
- Solving quantitative reasoning problems with language models. Preprint, arXiv:2206.14858.
- Famicom: Further demystifying prompts for language models with task-agnostic performance estimation. arXiv preprint arXiv:2406.11243.
- Deceptive semantic shortcuts on reasoning chains: How far can models go without hallucination? In Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers), pages 7668–7681.
- Self-refine: Iterative refinement with self-feedback. In Thirty-seventh Conference on Neural Information Processing Systems.
- Gsm-symbolic: Understanding the limitations of mathematical reasoning in large language models. arXiv preprint arXiv:2410.05229.
- Gpt-4 technical report. Preprint, arXiv:2303.08774.
- Are nlp models really able to solve simple math word problems? In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 2080–2094.
- Arb: Advanced reasoning benchmark for large language models. Preprint, arXiv:2307.13692.
- Qwen Team. 2024. Qwen2.5: A party of foundation models.
- Self-consistency improves chain of thought reasoning in language models. arXiv preprint arXiv:2203.11171.
- Chain-of-thought prompting elicits reasoning in large language models. Advances in neural information processing systems, 35:24824–24837.
- Evaluating mathematical reasoning beyond accuracy. arXiv preprint arXiv:2404.05692.
- ReEval: Automatic hallucination evaluation for retrieval-augmented large language models via transferable adversarial attacks. In Findings of the Association for Computational Linguistics: NAACL 2024, pages 1333–1351, Mexico City, Mexico. Association for Computational Linguistics.
- Conceptual and unbiased reasoning in language models. arXiv preprint arXiv:2404.00205.