Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
41 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Evaluating Mathematical Reasoning of Large Language Models: A Focus on Error Identification and Correction (2406.00755v1)

Published 2 Jun 2024 in cs.CL, cs.AI, and cs.LG

Abstract: The rapid advancement of LLMs in the realm of mathematical reasoning necessitates comprehensive evaluations to gauge progress and inspire future directions. Existing assessments predominantly focus on problem-solving from the examinee perspective, overlooking a dual perspective of examiner regarding error identification and correction. From the examiner perspective, we define four evaluation tasks for error identification and correction along with a new dataset with annotated error types and steps. We also design diverse prompts to thoroughly evaluate eleven representative LLMs. Our principal findings indicate that GPT-4 outperforms all models, while open-source model LLaMA-2-7B demonstrates comparable abilities to closed-source models GPT-3.5 and Gemini Pro. Notably, calculation error proves the most challenging error type. Moreover, prompting LLMs with the error types can improve the average correction accuracy by 47.9\%. These results reveal potential directions for developing the mathematical reasoning abilities of LLMs. Our code and dataset is available on https://github.com/LittleCirc1e/EIC.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (6)
  1. Xiaoyuan Li (6 papers)
  2. Wenjie Wang (150 papers)
  3. Moxin Li (13 papers)
  4. Junrong Guo (1 paper)
  5. Yang Zhang (1129 papers)
  6. Fuli Feng (143 papers)
Citations (9)