ReasonAgain: Using Extractable Symbolic Programs to Evaluate Mathematical Reasoning (2410.19056v1)

Published 24 Oct 2024 in cs.AI

Abstract: Existing math datasets evaluate the reasoning abilities of LLMs by either using the final answer or the intermediate reasoning steps derived from static examples. However, the former approach fails to surface model's uses of shortcuts and wrong reasoning while the later poses challenges in accommodating alternative solutions. In this work, we seek to use symbolic programs as a means for automated evaluation if a model can consistently produce correct final answers across various inputs to the program. We begin by extracting programs for popular math datasets (GSM8K and MATH) using GPT4-o. For those executable programs verified using the original input-output pairs, they are found to encapsulate the proper reasoning required to solve the original text questions. We then prompt GPT4-o to generate new questions using alternative input-output pairs based the extracted program. We apply the resulting datasets to evaluate a collection of LLMs. In our experiments, we observe significant accuracy drops using our proposed evaluation compared with original static examples, suggesting the fragility of math reasoning in state-of-the-art LLMs.

References (21)

Summary

The paper introduces extractable symbolic programs to assess LLM mathematical reasoning, bypassing limitations of static evaluations.
It generates new test cases from GSM8K and MATH datasets, revealing substantial accuracy drops in advanced models like GPT-4-turbo.
The evaluation framework exposes models’ reliance on superficial cues, prompting a call for deeper, semantically robust reasoning in AI systems.

An Evaluation Framework for Mathematical Reasoning in LLMs Using Symbolic Programs

The paper "ReasonAgain: Using Extractable Symbolic Programs to Evaluate Mathematical Reasoning" presents a structured approach to assessing the mathematical reasoning abilities of LLMs, using symbolic programs. This paper focuses on addressing limitations in existing evaluation methods that either rely solely on final answer accuracy or struggle to accommodate diverse solution paths. By automatically extracting symbolic, executable programs from traditional math datasets, the authors propose a new method for evaluating whether LLMs can consistently solve mathematical problems across varied inputs derived from these programs.

Overview of Methodology

The authors introduce an innovative approach aimed at bypassing the limitations of static dataset evaluations. Two primary datasets, GSM8K and MATH, serve as the basis for extracting symbolic programs using a state-of-the-art model, GPT4-o. These symbolic programs encapsulate the necessary steps for solving the mathematical problems presented, allowing for the creation of new, diverse test cases by varying the inputs fed into the programs.

The evaluation leverages these test cases to examine the robustness of LLMs' reasoning capabilities. In essence, if a model genuinely understands the reasoning process necessary for a given problem, it should perform consistently well on both original and perturbed questions generated through these symbolic programs.

Key Findings and Results

Upon evaluating a collection of LLMs, the paper reports significant accuracy declines when models are subjected to the proposed evaluation framework compared to traditional static examples. Notably, even advanced models like GPT-4-turbo exhibit substantial performance drops, correctly answering less than half of the newly generated test cases in a number of scenarios. For instance, results indicate that models answer more than 50% correctly when initially correct responses are perturbed—highlighting vulnerabilities in their reasoning processes.

The authors provide detailed accuracy metrics, distinguishing between initial performance on static datasets and the new accuracy observed after perturbing input-output pairs. These results serve to illustrate the models' reliance on superficial cues or shortcuts rather than in-depth reasoning, as evidenced by the significant decrease in performance in most cases.

Implications and Future Directions

This paper's approach to evaluation has potential implications for both the development and assessment of LLMs within the sphere of mathematical reasoning. By demonstrating that current models lack robustness in reasoning tasks, the paper suggests that existing evaluation metrics may overstate the capabilities of these sophisticated models.

Looking ahead, this research paves the way for various advancements in both AI model training and evaluation. It underscores the need for LLMs to achieve a deeper understanding and consistent application of reasoning processes across diverse scenarios. This evaluation method could guide future model improvements, inspiring AI researchers to explore methodologies that prioritize interpretability and adherence to semantic logic as interpretative benchmarks.

Conclusion

In conclusion, "ReasonAgain" contributes a novel perspective to LLM evaluation methodologies by introducing symbolic programs as a means to assess mathematical reasoning. This work not only highlights critical weaknesses in current models but also offers a structured path toward overcoming these limitations with innovative evaluation techniques. Future research could extend this framework to other domains and model types, contributing to sustained advancement in artificial intelligence reasoning capabilities.

PDF Markdown