Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
169 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
45 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

ReasonAgain: Using Extractable Symbolic Programs to Evaluate Mathematical Reasoning (2410.19056v1)

Published 24 Oct 2024 in cs.AI

Abstract: Existing math datasets evaluate the reasoning abilities of LLMs by either using the final answer or the intermediate reasoning steps derived from static examples. However, the former approach fails to surface model's uses of shortcuts and wrong reasoning while the later poses challenges in accommodating alternative solutions. In this work, we seek to use symbolic programs as a means for automated evaluation if a model can consistently produce correct final answers across various inputs to the program. We begin by extracting programs for popular math datasets (GSM8K and MATH) using GPT4-o. For those executable programs verified using the original input-output pairs, they are found to encapsulate the proper reasoning required to solve the original text questions. We then prompt GPT4-o to generate new questions using alternative input-output pairs based the extracted program. We apply the resulting datasets to evaluate a collection of LLMs. In our experiments, we observe significant accuracy drops using our proposed evaluation compared with original static examples, suggesting the fragility of math reasoning in state-of-the-art LLMs.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (21)
  1. Training verifiers to solve math word problems. arXiv preprint arXiv:2110.14168.
  2. The llama 3 herd of models. Preprint, arXiv:2407.21783.
  3. Pal: Program-aided language models. In International Conference on Machine Learning, pages 10764–10799. PMLR.
  4. Evaluating models’ local decision boundaries via contrast sets. In Findings of the Association for Computational Linguistics: EMNLP 2020, pages 1307–1323.
  5. Roscoe: A suite of metrics for scoring step-by-step reasoning. Preprint, arXiv:2212.07919.
  6. Reasoning with language model is planning with world model. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 8154–8173.
  7. Measuring mathematical problem solving with the math dataset. arXiv preprint arXiv:2103.03874.
  8. Solving quantitative reasoning problems with language models. Preprint, arXiv:2206.14858.
  9. Famicom: Further demystifying prompts for language models with task-agnostic performance estimation. arXiv preprint arXiv:2406.11243.
  10. Deceptive semantic shortcuts on reasoning chains: How far can models go without hallucination? In Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers), pages 7668–7681.
  11. Self-refine: Iterative refinement with self-feedback. In Thirty-seventh Conference on Neural Information Processing Systems.
  12. Gsm-symbolic: Understanding the limitations of mathematical reasoning in large language models. arXiv preprint arXiv:2410.05229.
  13. Gpt-4 technical report. Preprint, arXiv:2303.08774.
  14. Are nlp models really able to solve simple math word problems? In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 2080–2094.
  15. Arb: Advanced reasoning benchmark for large language models. Preprint, arXiv:2307.13692.
  16. Qwen Team. 2024. Qwen2.5: A party of foundation models.
  17. Self-consistency improves chain of thought reasoning in language models. arXiv preprint arXiv:2203.11171.
  18. Chain-of-thought prompting elicits reasoning in large language models. Advances in neural information processing systems, 35:24824–24837.
  19. Evaluating mathematical reasoning beyond accuracy. arXiv preprint arXiv:2404.05692.
  20. ReEval: Automatic hallucination evaluation for retrieval-augmented large language models via transferable adversarial attacks. In Findings of the Association for Computational Linguistics: NAACL 2024, pages 1333–1351, Mexico City, Mexico. Association for Computational Linguistics.
  21. Conceptual and unbiased reasoning in language models. arXiv preprint arXiv:2404.00205.

Summary

  • The paper introduces extractable symbolic programs to assess LLM mathematical reasoning, bypassing limitations of static evaluations.
  • It generates new test cases from GSM8K and MATH datasets, revealing substantial accuracy drops in advanced models like GPT-4-turbo.
  • The evaluation framework exposes models’ reliance on superficial cues, prompting a call for deeper, semantically robust reasoning in AI systems.

An Evaluation Framework for Mathematical Reasoning in LLMs Using Symbolic Programs

The paper "ReasonAgain: Using Extractable Symbolic Programs to Evaluate Mathematical Reasoning" presents a structured approach to assessing the mathematical reasoning abilities of LLMs, using symbolic programs. This paper focuses on addressing limitations in existing evaluation methods that either rely solely on final answer accuracy or struggle to accommodate diverse solution paths. By automatically extracting symbolic, executable programs from traditional math datasets, the authors propose a new method for evaluating whether LLMs can consistently solve mathematical problems across varied inputs derived from these programs.

Overview of Methodology

The authors introduce an innovative approach aimed at bypassing the limitations of static dataset evaluations. Two primary datasets, GSM8K and MATH, serve as the basis for extracting symbolic programs using a state-of-the-art model, GPT4-o. These symbolic programs encapsulate the necessary steps for solving the mathematical problems presented, allowing for the creation of new, diverse test cases by varying the inputs fed into the programs.

The evaluation leverages these test cases to examine the robustness of LLMs' reasoning capabilities. In essence, if a model genuinely understands the reasoning process necessary for a given problem, it should perform consistently well on both original and perturbed questions generated through these symbolic programs.

Key Findings and Results

Upon evaluating a collection of LLMs, the paper reports significant accuracy declines when models are subjected to the proposed evaluation framework compared to traditional static examples. Notably, even advanced models like GPT-4-turbo exhibit substantial performance drops, correctly answering less than half of the newly generated test cases in a number of scenarios. For instance, results indicate that models answer more than 50% correctly when initially correct responses are perturbed—highlighting vulnerabilities in their reasoning processes.

The authors provide detailed accuracy metrics, distinguishing between initial performance on static datasets and the new accuracy observed after perturbing input-output pairs. These results serve to illustrate the models' reliance on superficial cues or shortcuts rather than in-depth reasoning, as evidenced by the significant decrease in performance in most cases.

Implications and Future Directions

This paper's approach to evaluation has potential implications for both the development and assessment of LLMs within the sphere of mathematical reasoning. By demonstrating that current models lack robustness in reasoning tasks, the paper suggests that existing evaluation metrics may overstate the capabilities of these sophisticated models.

Looking ahead, this research paves the way for various advancements in both AI model training and evaluation. It underscores the need for LLMs to achieve a deeper understanding and consistent application of reasoning processes across diverse scenarios. This evaluation method could guide future model improvements, inspiring AI researchers to explore methodologies that prioritize interpretability and adherence to semantic logic as interpretative benchmarks.

Conclusion

In conclusion, "ReasonAgain" contributes a novel perspective to LLM evaluation methodologies by introducing symbolic programs as a means to assess mathematical reasoning. This work not only highlights critical weaknesses in current models but also offers a structured path toward overcoming these limitations with innovative evaluation techniques. Future research could extend this framework to other domains and model types, contributing to sustained advancement in artificial intelligence reasoning capabilities.

Youtube Logo Streamline Icon: https://streamlinehq.com