Evaluating the Mathematical Reasoning Abilities of Modern Models on MathVista
The paper "MathVista: Evaluating Mathematical Reasoning of Foundation Models in Visual Contexts" presents a comprehensive benchmark, MathVista, to evaluate the mathematical reasoning capabilities of state-of-the-art models in visually rich environments. This endeavor aims to bridge the evident gap in existing evaluations that predominantly focus on textual mathematical reasoning, thereby overlooking the intrinsic visual nature of many mathematical problems.
MathVista is an extensive dataset composed of 6,141 examples sourced from 28 existing multimodal datasets accompanied by contributions from three newly formulated datasets: IQTest, FunctionQA, and PaperQA. These newly created datasets aim to fill the gaps left by existing resources by emphasizing logical reasoning on puzzles, algebraic reasoning on plots, and scientific reasoning with academic figures.
The evaluation benchmark covers five primary tasks: Figure Question Answering (FQA), Geometry Problem Solving (GPS), Math Word Problem (MWP), Textbook Question Answering (TQA), and Visual Question Answering (VQA). The paper focuses on seven core types of mathematical reasoning: algebraic, arithmetic, geometry, logical, numeric commonsense, scientific, and statistical reasoning.
A thorough analysis was conducted on 12 prominent models, including LLMs and LMMs. GPT-4V, the multimodal version of GPT-4, demonstrated superior performance by achieving an overall accuracy of 49.9\%, surpassing the Multimodal Bard, which stood at 34.8\%. Despite this significant advancement, GPT-4V still falls 10.4\% short of human performance, highlighting considerable scope for improvement.
Moreover, GPT-4V excelled particularly in algebraic and geometric reasoning, even surpassing human performance in some visual contexts like function plots and geometry diagrams. The analysis also revealed the emergent capability of GPT-4V to perform self-verification, which involves refining responses through internal consistency checks.
From a broader perspective, this paper illustrates an imperative need to continuously develop and refine general-purpose AI capable of effective mathematical reasoning within visual contexts. The limitations observed in current models, including challenges in logical reasoning or interpreting complex figures, suggest potential areas for future research and development.
In conclusion, MathVista stands as a pivotal contribution to the evaluation of AI models, offering a rigorous framework which underscores both the progress and challenges that lie ahead in mathematical reasoning within visual contexts. Achieving parity with human reasoning abilities across diverse tasks and contexts remains an ambitious yet critical goal for the AI research community.