RIMO-N: Olympiad Math Benchmark
- RIMO-N is a benchmark that rewrites IMO problems into a unique integer answer format for noise-free evaluation of LLM mathematical reasoning.
- It applies deterministic transformation rules to ensure unambiguous grading while preserving Olympiad-level complexity across algebra, geometry, number theory, and combinatorics.
- Empirical results reveal a 30–50 point drop in accuracy compared to standard benchmarks, highlighting the challenge of advanced mathematical reasoning for current LLMs.
RIMO-N is a benchmark suite that recasts real International Mathematical Olympiad (IMO) and shortlist problems into unique-integer answer format, yielding a noise-free, deterministic metric for evaluating LLM mathematical reasoning at Olympiad difficulty. Developed as the first track of the RIMO benchmark, RIMO-N addresses critical shortcomings in prior mathematical LLM benchmarks by simultaneously enforcing IMO-level complexity and absolute reproducibility of evaluation, thus constituting a central standard for the assessment of advanced machine reasoning on mathematics (Chen et al., 9 Sep 2025).
1. Motivation and Benchmarking Rationale
RIMO-N was introduced in response to saturation on high-school–level math datasets such as GSM8K and MATH, where leading LLMs routinely surpass 90% accuracy. When extrapolating evaluation to Olympiad material, existing benchmarks—derived from genuine IMO problems—historically suffered from several confounding variables:
- Grading noise from answers in multiple formats (fractions, radicals, intervals, open-form proofs).
- Reliance on model-based judges or normalizer scripts, introducing subjectivity and reproducibility issues. RIMO-N circumvents these pitfalls by systematically rewriting each source problem so that it admits exactly one integer answer. Thus, correctness reduces to O(1) string-matching, removing any dependency on learned or heuristic grading mechanisms or external symbolic computational engines. The integrity of the original logical challenge is preserved: only the final marked outcome is altered to enforce unicity and integrality, never the combinatorial or analytic structure of the problem (Chen et al., 9 Sep 2025).
2. Problem Selection, Curation, and Remaking Methodology
The RIMO-N corpus is based on a comprehensive sweep of all IMO contest and shortlist problems from 1959 through 2023. The construction pipeline involves several deterministic rules:
- Every contest and shortlist problem is included subject to the existence of at least two matching integer solutions from independent sources (jury, AoPS Wiki, YouTube, ParSe); disagreements trigger manual review.
- Each problem is rewritten so that the final query is “Compute the unique integer satisfying …”, with original hypotheses tightened if required to ensure unicity and answer format uniformity.
- Typical transformations include: recasting existence or concurrency claims as integer counts, transforming set classification into indexed summations, and reformulating open enumerations into precise tally or aggregation questions. The protocol always maintains full logical equivalence and IMO difficulty but abjures answer ambiguity or non-integer correct responses. Where ambiguity might arise in earlier records, contest items are admitted only after confirming consensus on an integer solution (Chen et al., 9 Sep 2025).
3. Dataset Structure, Problem Distribution, and Statistical Analysis
RIMO-N contains a total of 335 problems, distributed across algebra (96), geometry (95), number theory (86), and combinatorics (58), spanning five clear difficulty tiers from accessible shortlist items to hardest contest challenges. 236 problems derive from the shortlist and 99 from actual contest papers. Answer values are distributed to inhibit guessing:
- 96 problems have answers in the binary set , reflecting transformed T/F claims or intersection counts.
- The remaining problems’ solutions span two- and three-digit integers, negating strategies based solely on output bias. This structure ensures the testbed embodies IMO-level heterogeneity both in topic and answer pattern, inhibiting overfitting to narrow methods or statistical priors (Chen et al., 9 Sep 2025).
| Topic | Problem Count | Example Transform |
|---|---|---|
| Algebra | 96 | Sums over tuples |
| Geometry | 95 | Intersection count |
| Number Theory | 86 | Integer classification |
| Combinatorics | 58 | Aggregated counts |
4. Evaluation Protocol and Metric
Every model is assessed by greedy decoding (temperature ). The candidate model output is parsed and compared by exact string match against the unique correct integer. The evaluation metric is pass@1 accuracy over the entire 335-problem suite. This protocol enforces:
- Zero dependency on learned or engineered judges.
- Deterministic, O(1) cost per sample.
- Full reproducibility and unambiguity, with no symbolic normalization or canonicalization (Chen et al., 9 Sep 2025).
5. Model Performance and Results
A comprehensive benchmarking campaign includes ten leading LLMs, encompassing Qwen3-8B, GPT-4o, Gemini-2.5-flash, DeepSeek-R1-671B, among others. Results reveal a dramatic accuracy collapse relative to prior benchmarks:
| Model | GSM8K | MATH | RIMO-N |
|---|---|---|---|
| Qwen3-8B | 93.00 % | 70.90 % | 36.72 % |
| GPT-4o-2024-08-06 | 95.80 % | 64.88 % | 33.43 % |
| Gemini-2.5-flash | 97.04 % | 91.31 % | 58.81 % |
| DeepSeek-R1-671B | 96.13 % | 90.45 % | 62.96 % |
All models exhibit a 30–50 point drop in pass@1 when facing RIMO-N. The gap persists independent of scale: for example, Gemini-2.5-flash nearly matches DeepSeek-R1-671B, while GPT-4o lags notably. Distilled checkpoints can underperform their smaller counterparts, with results contingent on training data quality and objectives. Models optimized for explicit chain-of-thought or self-refinement robustly surpass vanilla versions of equivalent size (Chen et al., 9 Sep 2025).
6. Representative Problem Transformations
RIMO-N problems are faithful encodings of canonical IMO constructs. Illustrative examples include:
- Concurrency statements recast as intersection counts: "How many common intersection points do these three altitudes have?"
- Classification over integer triples reframed into summation: "Let vary over all integer solutions to . For each, set ; compute ."
- Negative existence proofs into zero counts: "How many triples of diagonals in a regular pentagon concur at a single point?" (Chen et al., 9 Sep 2025).
7. Significance, Insights, and Implications
RIMO-N provides the first mathematically rigorous, fully noise-free Olympiad-grade benchmark suite, facilitating high-resolution evaluation of mathematical reasoning. The transition from ambiguous answer formats to the unique-integer regime enables reproducibility and direct comparisons across models and epochs. The marked performance drop across all modern LLMs—despite strong results on GSM8K or MATH—exposes a persistent gap between large-model algebraic manipulation and genuine Olympiad-level problem solving. Restriction to binary-answer subsets artificially inflates scores by 8–30 points, confirming that sparsity is a nontrivial aspect of the challenge. A plausible implication is that further progress on chain-of-thought–driven training objectives, enhanced retrieval, and explicit mathematical reasoning pipelines is required to approach true Olympiad competence (Chen et al., 9 Sep 2025).