Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
134 tokens/sec
GPT-4o
9 tokens/sec
Gemini 2.5 Pro Pro
47 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

U-MATH: A University-Level Benchmark for Evaluating Mathematical Skills in LLMs (2412.03205v3)

Published 4 Dec 2024 in cs.CL and cs.AI

Abstract: The current evaluation of mathematical skills in LLMs is limited, as existing benchmarks are either relatively small, primarily focus on elementary and high-school problems, or lack diversity in topics. Additionally, the inclusion of visual elements in tasks remains largely under-explored. To address these gaps, we introduce U-MATH, a novel benchmark of 1,100 unpublished open-ended university-level problems sourced from teaching materials. It is balanced across six core subjects, with 20% of multimodal problems. Given the open-ended nature of U-MATH problems, we employ an LLM to judge the correctness of generated solutions. To this end, we release $\mu$-MATH, a dataset to evaluate the LLMs' capabilities in judging solutions. The evaluation of general domain, math-specific, and multimodal LLMs highlights the challenges presented by U-MATH. Our findings reveal that LLMs achieve a maximum accuracy of only 63% on text-based tasks, with even lower 45% on visual problems. The solution assessment proves challenging for LLMs, with the best LLM judge having an F1-score of 80% on $\mu$-MATH.

Summary

  • The paper introduces U-MATH, a benchmark comprising 1,100 unique university-level math problems that challenge LLMs with complex reasoning tasks.
  • It employs multimodal tasks with 20% visual problems and a novel Meta U-MATH component to rigorously assess LLM judging capabilities.
  • Results show LLMs achieving only up to 63% accuracy on text tasks and 45% on multimodal tasks, highlighting the need for more advanced evaluation methods.

An Evaluation of Mathematical Capabilities in LLMs through the U-MATH Benchmark

In the ongoing pursuit to effectively quantify the mathematical reasoning capabilities of LLMs, the paper introduces U-MATH, a benchmark explicitly designed to evaluate the proficiency of these models at coping with university-level mathematical challenges. This paper identifies several limitations in current benchmarks, which are primarily confined to elementary or high school mathematics and do not adequately encompass the depth or breadth offered by university coursework. Notably, LLMs like GPT-4 have achieved remarkable success with existing benchmarks like GSM8K and MATH but encounter considerable hurdles with advanced academic material.

The U-MATH benchmark is composed of 1,100 unique, unpublished, open-ended problems sourced from real-world teaching materials. It spans six core subjects, with a notable inclusion of 20% of the problems being multimodal, requiring the synthesis of both textual and visual information. A distinct feature of U-MATH is that it is specifically structured to challenge LLMs using problems that require a deeper level of reasoning than previously utilized benchmarks. In conjunction with U-MATH, the authors have introduced the Meta U-MATH (μ-MATH) benchmark, which scrutinizes the evaluative capabilities of LLM judges, using tasks sourced from U-MATH problems.

The results from testing various LLMs reveal sobering performances. Even among general, math-specific, and multimodal models, the highest accuracy achieved was only 63% on text-based tasks and a mere 45% on multimodal tasks. The analysis further extends to the evaluation of LLMs as judges, where the LLM best performing model achieved an F1-score of 80%, underscoring the inherent challenges faced when these models are tasked with assessing complex free-form answers.

While models such as Qwen2.5-Math-72B led the scores for open-source approaches and Gemini-1.5-pro-002 led the proprietary space, it was evident that larger model sizes did not necessarily equate to increased accuracy, particularly in the judging function within the μ-MATH framework. The open challenges in visual problem-solving also highlighted an area ripe for further exploration and enhancement in LLM multimodal integration.

This research has far-reaching implications. The inadequacies in model performance on advanced problems underscore a need for the development and improvement of more sophisticated models or methodologies, particularly those that integrate tool augmentation or hybrid human-in-the-loop solutions. Furthermore, the meta-evaluation dataset provides a crucial dimension to future research into optimization of evaluation mechanisms themselves, marking a significant step toward the more reliable and unbiased assessments of LLM capabilities.

In summary, the introduction of U-MATH and its accompanying meta-evaluation component offers a robust mechanism for testing nuanced mathematical reasoning in LLMs. This paper paves the way for future research to focus on developing more specialized and reliable tools to bridge the performance gaps observed, especially in tools specifically designed to handle university-level mathematical reasoning and integrated evaluative capabilities. The research calls for a concerted effort to advance LLM technology towards meeting and exceeding these complex cognitive benchmarks.