Omni-MATH: A Universal Olympiad Level Mathematic Benchmark For Large Language Models (2410.07985v2)

Published 10 Oct 2024 in cs.CL

Abstract: Recent advancements in LLMs have led to significant breakthroughs in mathematical reasoning capabilities. However, existing benchmarks like GSM8K or MATH are now being solved with high accuracy (e.g., OpenAI o1 achieves 94.8% on MATH dataset), indicating their inadequacy for truly challenging these models. To bridge this gap, we propose a comprehensive and challenging benchmark specifically designed to assess LLMs' mathematical reasoning at the Olympiad level. Unlike existing Olympiad-related benchmarks, our dataset focuses exclusively on mathematics and comprises a vast collection of 4428 competition-level problems with rigorous human annotation. These problems are meticulously categorized into over 33 sub-domains and span more than 10 distinct difficulty levels, enabling a holistic assessment of model performance in Olympiad-mathematical reasoning. Furthermore, we conducted an in-depth analysis based on this benchmark. Our experimental results show that even the most advanced models, OpenAI o1-mini and OpenAI o1-preview, struggle with highly challenging Olympiad-level problems, with 60.54% and 52.55% accuracy, highlighting significant challenges in Olympiad-level mathematical reasoning.

PDF HTML Abstract

Omni-MATH: A Universal Olympiad Level Mathematical Benchmark for LLMs

The paper "Omni-MATH: A Universal Olympiad Level Mathematic Benchmark for LLMs" introduces a sophisticated benchmark designed to rigorously evaluate the mathematical reasoning capabilities of LLMs at an Olympiad-level. This benchmark is positioned as a necessary progression from existing datasets, which are increasingly inadequate for assessing the upper echelons of LLM mathematical problem-solving abilities.

Key Contributions

Benchmark Development: Omni-MATH presents an exhaustive set of 4,428 competition-level problems, all of which are meticulously annotated by humans. These problems are categorized into 33 sub-domains and over 10 distinct difficulty levels. This stratification allows for a nuanced evaluation of LLMs, providing insight into their strengths and weaknesses across various mathematical domains.
Experimental Validation: Two advanced OpenAI models, o1-mini and o1-preview, were tested against the Omni-MATH benchmark, revealing significant performance gaps—accuracy rates of 60.54% and 52.55% respectively. These results indicate that even cutting-edge models face considerable challenges when navigating Olympiad-level problems.
Omni-Judge: Proposed as an open-source verification tool, Omni-Judge achieved over 91% consistency with GPT-4o evaluations and 86% with human judgments. This verifier offers a robust mechanism to evaluate Olympiad-level problems efficiently, validating the reliability of model assessments against human standards.

Implications and Future Directions

The development of Omni-MATH is a pivotal advancement, establishing a new standard for LLM evaluation in the domain of complex mathematical reasoning. The distinct categorization of problem domains and difficulties facilitates a granular analysis of LLM capabilities, enriching our understanding of model performance. The identified deficiencies, particularly in discrete mathematics, underscore ongoing challenges that necessitate further research and development.

Key future directions include:

Enhanced Test-Time Scaling: As identified, the Best-of-N approach has limited efficacy at the Olympiad level, suggesting the need for more innovative scaling techniques that can better leverage model capabilities during inference.
Focused Domain Training: The difficulties faced by LLMs, especially in discrete mathematics, suggest potential benefits from targeted domain-specific training regimes, which may enhance generalization across unseen problem types.
Robust Data Leakage Analysis: The paper identifies nominal data leakage in the model datasets, which has minimal impact on overall performance metrics. Continued vigilance in this area will ensure the integrity and applicability of the benchmark results.

In conclusion, Omni-MATH provides a comprehensive framework to critically evaluate LLMs' mathematical reasoning skills at a high level of complexity. The benchmarks set forth by this work are poised to guide the next wave of research and development in LLM mathematical capabilities, ensuring these models can meet real-world technological and educational demands. The ongoing iterations and enhancements to models on benchmarks like Omni-MATH will drive forward our understanding and capabilities in artificial intelligence for mathematical reasoning.