Evaluating Mathematical Reasoning of Large Language Models: A Focus on Error Identification and Correction (2406.00755v1)
Abstract: The rapid advancement of LLMs in the realm of mathematical reasoning necessitates comprehensive evaluations to gauge progress and inspire future directions. Existing assessments predominantly focus on problem-solving from the examinee perspective, overlooking a dual perspective of examiner regarding error identification and correction. From the examiner perspective, we define four evaluation tasks for error identification and correction along with a new dataset with annotated error types and steps. We also design diverse prompts to thoroughly evaluate eleven representative LLMs. Our principal findings indicate that GPT-4 outperforms all models, while open-source model LLaMA-2-7B demonstrates comparable abilities to closed-source models GPT-3.5 and Gemini Pro. Notably, calculation error proves the most challenging error type. Moreover, prompting LLMs with the error types can improve the average correction accuracy by 47.9\%. These results reveal potential directions for developing the mathematical reasoning abilities of LLMs. Our code and dataset is available on https://github.com/LittleCirc1e/EIC.
- Xiaoyuan Li (6 papers)
- Wenjie Wang (150 papers)
- Moxin Li (13 papers)
- Junrong Guo (1 paper)
- Yang Zhang (1129 papers)
- Fuli Feng (143 papers)