OlymMATH Benchmark Overview
The OlymMATH benchmark, presented in "Challenging the Boundaries of Reasoning: An Olympiad-Level Math Benchmark for LLMs" (Sun et al., 27 Mar 2025 ), is introduced to address the saturation observed in existing mathematical reasoning benchmarks like GSM8K and MATH when evaluated against state-of-the-art (SOTA) LLMs. While datasets such as AIME provide greater difficulty, they face limitations including smaller scale and diminishing challenge for top-tier models. OlymMATH aims to provide a more rigorous, challenging, and comprehensive evaluation framework, notably by incorporating parallel English and Chinese versions to facilitate bilingual assessment.
Benchmark Design and Curation Process
The construction of OlymMATH involved a meticulous, manual curation process to mitigate the risk of data contamination often associated with web-scraped datasets.
- Source Material: The 200 problems were sourced exclusively from printed materials, including specialized mathematical magazines, textbooks, and official competition documents. This deliberate choice minimizes the likelihood that the problems were part of the training corpora of the LLMs being evaluated.
- Expert Verification: Each problem underwent verification and annotation by domain experts to ensure correctness and suitability.
- Scope and Subject Areas: The benchmark spans four core mathematical disciplines typically found in Olympiad-level competitions: Algebra, Geometry, Number Theory, and Combinatorics. These fields were selected for their potential to generate challenging problems with verifiable solutions.
- Text-Based Representation: All problems are presented in a text-only format. Geometry problems originally requiring diagrams were either excluded or carefully reformulated to ensure all necessary geometric information could be fully conveyed through textual descriptions. An example of this textual representation for a geometry problem is provided in Figure 3 of the paper.
- Standardized Answer Format: To enable objective, rule-based, and automated evaluation, the expected answer format is strictly limited to real numbers or intervals. This constraint excludes answers involving set operations, variables, complex numbers, or free-form text. The
sympy
library is mentioned as a tool for verifying numerical equivalence, accommodating different mathematical representations of the same numerical value (e.g., vs ). - Handling Multi-Solution Problems: For problems that naturally yield multiple valid solutions, the formulation was adjusted. Instead of requiring the model to list all solutions, the problem asks for a specific summary statistic derived from the set of all possible solutions (e.g., their sum or the sum of their squares), thus maintaining the single numerical answer format. Figure 4 illustrates this approach.
- Format Compatibility: The dataset structure is designed to be compatible with the format used by the popular MATH benchmark (Hendrycks et al., 2021 ), facilitating easier integration into existing evaluation pipelines. Figure 5 shows an example instance.
Difficulty Stratification: Easy vs. Hard
OlymMATH incorporates a deliberate stratification of problems based on difficulty, dividing the 200 problems equally into two tiers:
- OlymMATH-EASY: This subset comprises 100 problems calibrated to be approximately at the difficulty level of the American Invitational Mathematics Examination (AIME). It serves as a baseline measure, challenging mainstream LLMs using standard prompting methods. Empirical results suggest its difficulty aligns with recent AIME competition datasets.
- OlymMATH-HARD: This subset contains 100 significantly more challenging problems selected to probe the limits of current SOTA reasoning models, especially those employing advanced reasoning strategies (e.g., "slow thinking" paradigms). The goal is to provide finer-grained differentiation among top-performing models where easier benchmarks may show performance saturation.
Table 2 in the paper details the distribution of problems across the four mathematical fields within both the Easy and Hard subsets.
Bilingual Evaluation Framework
A distinctive feature of OlymMATH is its explicit support for bilingual evaluation.
- Translation Process: The problems were originally sourced in Chinese. They underwent a structured translation process into English, utilizing a two-stage LLM-based pipeline (Claude Sonnet 3.7 for initial translation, followed by GPT-4o for refinement). Crucially, human experts performed final verification and polishing to ensure mathematical accuracy and linguistic fidelity in the translated versions.
- Parallel Datasets: This process yielded two parallel datasets: OlymMATH-EN (English) and OlymMATH-ZH (Chinese), each containing the same 200 problems.
- Cross-Lingual Assessment: This parallel structure enables direct comparison of LLM mathematical reasoning performance across English and Chinese, addressing a gap in existing benchmarks that are predominantly monolingual. The paper argues for the necessity of such multilingual benchmarks for a holistic understanding of model capabilities, hypothesizing that observed performance differences may relate to the language distribution in pre-training data.
Empirical Evaluation and Key Findings
The paper presents empirical results from evaluating several prominent open-source (DeepSeek-R1, Qwen2.5-32B-R1D, QwQ-32B) and closed-source (OpenAI's o3-mini (high)) models on OlymMATH using Pass@1 (accuracy with a single generation attempt) and Cons@10 (consistency/accuracy across 10 samples) metrics.
- Significant Challenge: The results confirm the high difficulty of the benchmark, particularly the OlymMATH-HARD subset. The best-performing model tested, o3-mini (high), achieved only 30.3% Pass@1 on OlymMATH-EN-HARD and 27.2% on OlymMATH-ZH-HARD. DeepSeek-R1, another strong model, scored 21.2% and 16.1%, respectively, on these hard subsets (Tables 4 & 5). These scores are substantially lower than those reported for the same models on benchmarks like MATH-500 and AIME (Table 6), demonstrating OlymMATH's effectiveness in challenging current SOTA capabilities.
- Improved Model Differentiation: The benchmark, especially the HARD subset, provides better resolution in distinguishing the capabilities of high-performing models. The performance gap between models like DeepSeek-R1 and Qwen2.5-32B-R1D is more pronounced on OlymMATH-HARD compared to MATH-500, where scores are closer to saturation points (Table 6).
- Increased Reasoning Complexity: Analysis of response lengths generated by DeepSeek-R1 indicated significantly longer reasoning traces for OlymMATH problems compared to AIME. For instance, the average length for OlymMATH-EN-HARD solutions was approximately 42.8K characters (Table 7, Figure 6), quantitatively supporting the claim that these problems demand more extensive and complex reasoning steps.
- Cross-Lingual Performance Gap: A consistent finding across evaluated models was superior performance on the English versions (OlymMATH-EN) compared to the Chinese versions (OlymMATH-ZH), suggesting potential biases stemming from the predominantly English nature of large-scale pre-training corpora.
- Qualitative Insights on Reasoning: Qualitative analysis (Section 3.4, Figures 7-10) revealed instances where models might arrive at correct final answers through potentially unsound heuristics, assumptions (e.g., symmetry), or pattern matching ("empirical guessing") rather than rigorous mathematical derivation. The paper suggests that the increased complexity of OlymMATH-HARD problems makes them less susceptible to such shortcuts, often leading to incorrect answers when models attempt these strategies. This highlights the limitation of final-answer-only evaluation and motivates the need for process-based supervision and evaluation methods.
Conclusion
OlymMATH presents a challenging, manually curated, bilingual benchmark for evaluating advanced mathematical reasoning in LLMs at the Olympiad level. By sourcing problems from print materials, ensuring careful verification, stratifying difficulty, and providing parallel language versions, it addresses limitations of existing benchmarks. Empirical results demonstrate its ability to push the boundaries of current SOTA models and provide finer-grained differentiation, highlighting significant headroom for improvement in complex reasoning capabilities. The benchmark is released as part of the STILL project (https://github.com/RUCAIBox/Slow_Thinking_with_LLMs) to facilitate further research in this area.