Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
41 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Challenging the Boundaries of Reasoning: An Olympiad-Level Math Benchmark for Large Language Models (2503.21380v2)

Published 27 Mar 2025 in cs.CL

Abstract: In recent years, the rapid development of large reasoning models has resulted in the saturation of existing benchmarks for evaluating mathematical reasoning, highlighting the urgent need for more challenging and rigorous evaluation frameworks. To address this gap, we introduce OlymMATH, a novel Olympiad-level mathematical benchmark, designed to rigorously test the complex reasoning capabilities of LLMs. OlymMATH features 200 meticulously curated problems, each manually verified and available in parallel English and Chinese versions. The problems are systematically organized into two distinct difficulty tiers: (1) AIME-level problems (easy) that establish a baseline for mathematical reasoning assessment, and (2) significantly more challenging problems (hard) designed to push the boundaries of current state-of-the-art models. In our benchmark, these problems span four core mathematical fields, each including a verifiable numerical solution to enable objective, rule-based evaluation. Empirical results underscore the significant challenge presented by OlymMATH, with state-of-the-art models including DeepSeek-R1, OpenAI's o3-mini and Gemini 2.5 Pro Exp demonstrating notably limited accuracy on the hard subset. Furthermore, the benchmark facilitates comprehensive bilingual assessment of mathematical reasoning abilities-a critical dimension that remains largely unaddressed in mainstream mathematical reasoning benchmarks. We release the benchmark, evaluation code, detailed results and a data visualization tool at https://github.com/RUCAIBox/OlymMATH.

OlymMATH Benchmark Overview

The OlymMATH benchmark, presented in "Challenging the Boundaries of Reasoning: An Olympiad-Level Math Benchmark for LLMs" (Sun et al., 27 Mar 2025 ), is introduced to address the saturation observed in existing mathematical reasoning benchmarks like GSM8K and MATH when evaluated against state-of-the-art (SOTA) LLMs. While datasets such as AIME provide greater difficulty, they face limitations including smaller scale and diminishing challenge for top-tier models. OlymMATH aims to provide a more rigorous, challenging, and comprehensive evaluation framework, notably by incorporating parallel English and Chinese versions to facilitate bilingual assessment.

Benchmark Design and Curation Process

The construction of OlymMATH involved a meticulous, manual curation process to mitigate the risk of data contamination often associated with web-scraped datasets.

  • Source Material: The 200 problems were sourced exclusively from printed materials, including specialized mathematical magazines, textbooks, and official competition documents. This deliberate choice minimizes the likelihood that the problems were part of the training corpora of the LLMs being evaluated.
  • Expert Verification: Each problem underwent verification and annotation by domain experts to ensure correctness and suitability.
  • Scope and Subject Areas: The benchmark spans four core mathematical disciplines typically found in Olympiad-level competitions: Algebra, Geometry, Number Theory, and Combinatorics. These fields were selected for their potential to generate challenging problems with verifiable solutions.
  • Text-Based Representation: All problems are presented in a text-only format. Geometry problems originally requiring diagrams were either excluded or carefully reformulated to ensure all necessary geometric information could be fully conveyed through textual descriptions. An example of this textual representation for a geometry problem is provided in Figure 3 of the paper.
  • Standardized Answer Format: To enable objective, rule-based, and automated evaluation, the expected answer format is strictly limited to real numbers or intervals. This constraint excludes answers involving set operations, variables, complex numbers, or free-form text. The sympy library is mentioned as a tool for verifying numerical equivalence, accommodating different mathematical representations of the same numerical value (e.g., 23\sqrt{2-\sqrt{3}} vs (62)/2(\sqrt{6}-\sqrt{2})/2).
  • Handling Multi-Solution Problems: For problems that naturally yield multiple valid solutions, the formulation was adjusted. Instead of requiring the model to list all solutions, the problem asks for a specific summary statistic derived from the set of all possible solutions (e.g., their sum or the sum of their squares), thus maintaining the single numerical answer format. Figure 4 illustrates this approach.
  • Format Compatibility: The dataset structure is designed to be compatible with the format used by the popular MATH benchmark (Hendrycks et al., 2021 ), facilitating easier integration into existing evaluation pipelines. Figure 5 shows an example instance.

Difficulty Stratification: Easy vs. Hard

OlymMATH incorporates a deliberate stratification of problems based on difficulty, dividing the 200 problems equally into two tiers:

  • OlymMATH-EASY: This subset comprises 100 problems calibrated to be approximately at the difficulty level of the American Invitational Mathematics Examination (AIME). It serves as a baseline measure, challenging mainstream LLMs using standard prompting methods. Empirical results suggest its difficulty aligns with recent AIME competition datasets.
  • OlymMATH-HARD: This subset contains 100 significantly more challenging problems selected to probe the limits of current SOTA reasoning models, especially those employing advanced reasoning strategies (e.g., "slow thinking" paradigms). The goal is to provide finer-grained differentiation among top-performing models where easier benchmarks may show performance saturation.

Table 2 in the paper details the distribution of problems across the four mathematical fields within both the Easy and Hard subsets.

Bilingual Evaluation Framework

A distinctive feature of OlymMATH is its explicit support for bilingual evaluation.

  • Translation Process: The problems were originally sourced in Chinese. They underwent a structured translation process into English, utilizing a two-stage LLM-based pipeline (Claude Sonnet 3.7 for initial translation, followed by GPT-4o for refinement). Crucially, human experts performed final verification and polishing to ensure mathematical accuracy and linguistic fidelity in the translated versions.
  • Parallel Datasets: This process yielded two parallel datasets: OlymMATH-EN (English) and OlymMATH-ZH (Chinese), each containing the same 200 problems.
  • Cross-Lingual Assessment: This parallel structure enables direct comparison of LLM mathematical reasoning performance across English and Chinese, addressing a gap in existing benchmarks that are predominantly monolingual. The paper argues for the necessity of such multilingual benchmarks for a holistic understanding of model capabilities, hypothesizing that observed performance differences may relate to the language distribution in pre-training data.

Empirical Evaluation and Key Findings

The paper presents empirical results from evaluating several prominent open-source (DeepSeek-R1, Qwen2.5-32B-R1D, QwQ-32B) and closed-source (OpenAI's o3-mini (high)) models on OlymMATH using Pass@1 (accuracy with a single generation attempt) and Cons@10 (consistency/accuracy across 10 samples) metrics.

  • Significant Challenge: The results confirm the high difficulty of the benchmark, particularly the OlymMATH-HARD subset. The best-performing model tested, o3-mini (high), achieved only 30.3% Pass@1 on OlymMATH-EN-HARD and 27.2% on OlymMATH-ZH-HARD. DeepSeek-R1, another strong model, scored 21.2% and 16.1%, respectively, on these hard subsets (Tables 4 & 5). These scores are substantially lower than those reported for the same models on benchmarks like MATH-500 and AIME (Table 6), demonstrating OlymMATH's effectiveness in challenging current SOTA capabilities.
  • Improved Model Differentiation: The benchmark, especially the HARD subset, provides better resolution in distinguishing the capabilities of high-performing models. The performance gap between models like DeepSeek-R1 and Qwen2.5-32B-R1D is more pronounced on OlymMATH-HARD compared to MATH-500, where scores are closer to saturation points (Table 6).
  • Increased Reasoning Complexity: Analysis of response lengths generated by DeepSeek-R1 indicated significantly longer reasoning traces for OlymMATH problems compared to AIME. For instance, the average length for OlymMATH-EN-HARD solutions was approximately 42.8K characters (Table 7, Figure 6), quantitatively supporting the claim that these problems demand more extensive and complex reasoning steps.
  • Cross-Lingual Performance Gap: A consistent finding across evaluated models was superior performance on the English versions (OlymMATH-EN) compared to the Chinese versions (OlymMATH-ZH), suggesting potential biases stemming from the predominantly English nature of large-scale pre-training corpora.
  • Qualitative Insights on Reasoning: Qualitative analysis (Section 3.4, Figures 7-10) revealed instances where models might arrive at correct final answers through potentially unsound heuristics, assumptions (e.g., symmetry), or pattern matching ("empirical guessing") rather than rigorous mathematical derivation. The paper suggests that the increased complexity of OlymMATH-HARD problems makes them less susceptible to such shortcuts, often leading to incorrect answers when models attempt these strategies. This highlights the limitation of final-answer-only evaluation and motivates the need for process-based supervision and evaluation methods.

Conclusion

OlymMATH presents a challenging, manually curated, bilingual benchmark for evaluating advanced mathematical reasoning in LLMs at the Olympiad level. By sourcing problems from print materials, ensuring careful verification, stratifying difficulty, and providing parallel language versions, it addresses limitations of existing benchmarks. Empirical results demonstrate its ability to push the boundaries of current SOTA models and provide finer-grained differentiation, highlighting significant headroom for improvement in complex reasoning capabilities. The benchmark is released as part of the STILL project (https://github.com/RUCAIBox/Slow_Thinking_with_LLMs) to facilitate further research in this area.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (8)
  1. Haoxiang Sun (5 papers)
  2. Yingqian Min (14 papers)
  3. Zhipeng Chen (46 papers)
  4. Wayne Xin Zhao (196 papers)
  5. Zheng Liu (312 papers)
  6. Zhongyuan Wang (105 papers)
  7. Lei Fang (38 papers)
  8. Ji-Rong Wen (299 papers)
Github Logo Streamline Icon: https://streamlinehq.com