Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
80 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

GeoEval: Benchmark for Evaluating LLMs and Multi-Modal Models on Geometry Problem-Solving (2402.10104v2)

Published 15 Feb 2024 in cs.AI and cs.CL

Abstract: Recent advancements in LLMs and multi-modal models (MMs) have demonstrated their remarkable capabilities in problem-solving. Yet, their proficiency in tackling geometry math problems, which necessitates an integrated understanding of both textual and visual information, has not been thoroughly evaluated. To address this gap, we introduce the GeoEval benchmark, a comprehensive collection that includes a main subset of 2,000 problems, a 750 problems subset focusing on backward reasoning, an augmented subset of 2,000 problems, and a hard subset of 300 problems. This benchmark facilitates a deeper investigation into the performance of LLMs and MMs in solving geometry math problems. Our evaluation of ten LLMs and MMs across these varied subsets reveals that the WizardMath model excels, achieving a 55.67\% accuracy rate on the main subset but only a 6.00\% accuracy on the hard subset. This highlights the critical need for testing models against datasets on which they have not been pre-trained. Additionally, our findings indicate that GPT-series models perform more effectively on problems they have rephrased, suggesting a promising method for enhancing model capabilities.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (6)
  1. Jiaxin Zhang (105 papers)
  2. Zhongzhi Li (10 papers)
  3. Mingliang Zhang (17 papers)
  4. Fei Yin (36 papers)
  5. Chenglin Liu (3 papers)
  6. Yashar Moshfeghi (16 papers)
Citations (8)
X Twitter Logo Streamline Icon: https://streamlinehq.com