IMO-GradingBench: Proof Grading Benchmark

Updated 4 November 2025

IMO-GradingBench is an evaluation benchmark that tests autograder performance on proof-based solutions for International Mathematical Olympiad problems using a rigorously annotated dataset.
The benchmark employs standardized grading rubrics and metrics, including accuracy and mean absolute error, to assess partial credit and solution quality with precision.
It facilitates scalable research in automated grading by providing diverse, human-rated examples that reflect real-world mathematical reasoning challenges.

IMO-GradingBench is an evaluation benchmark for automated grading of proof-based solutions to Olympiad-level mathematics problems, specifically targeting the International Mathematical Olympiad (IMO). Developed as part of the IMO-Bench initiative (Luong et al., 3 Nov 2025), its principal contribution is a large, rigorously annotated dataset that enables direct, quantitative assessment of autograders—AI systems capable of assigning partial credit to mathematical proofs in a manner analogous to expert human graders. This resource aims to address the essential meta-reasoning task of evaluating not just final answers or solution existence, but the quality, rigor, and partial progress reflected in long-form mathematical reasoning.

1. Motivation and Evaluation Scope

IMO-GradingBench was created to measure and promote robustness in mathematical reasoning evaluation while filling methodological gaps in existing benchmarks, which traditionally focus on short-answer correctness or solution generation rather than grading free-form proofs. Its design responds to critical needs:

Scalability: Human grading is labor-intensive and slow for large collections of complex proofs;
Rigorous meta-reasoning: Automated system evaluation demands the ability to recognize nuance, partial progress, and minor errors—mirroring human grader standards;
Educational and research feedback: The benchmark facilitates training and evaluation of autograders for use in mathematical education and research.

It targets Olympiad-grade problems, advancing beyond saturated benchmarks such as MATH and GSM8K by addressing full proofs rather than short answers (Luong et al., 3 Nov 2025).

2. Dataset Construction and Annotation Methodology

IMO-GradingBench comprises 1,000 solutions to IMO-level proof problems, sampled from the Advanced IMO-ProofBench set. Each data point consists of:

The original problem statement;
A candidate solution, typically generated by cutting-edge LLMs, exhibiting varied correctness and quality;
A human-assigned grade (0–7 points), mapped to four discrete categories for robust metric computation.

Human graders—experienced mathematicians with IMO scoring expertise—assigned grades according to standard Olympiad protocol:

Category	IMO Points	Solution Quality
Correct	7	Fully correct, rigorous, and complete
Almost Correct	6–4	Minor errors or incomplete justification
Partial	1–3	Some progress but incomplete or flawed
Incorrect	0	Irrelevant or incorrect

Problems were sampled so that all four categories are represented roughly equally, maximizing diagnostic value when benchmarking autograders (Luong et al., 3 Nov 2025).

3. Grading Rubric and Guidelines

The benchmark employs a simplified rubric modeled on IMO standards. Graders assign solutions to precisely one of the following categories:

Correct: Full solution, with all steps justified and arguments complete;
Almost: Solution is correct in principle but contains minor omissions or inaccuracies;
Partial: Substantial work done, but significant gaps or errors remain;
Incorrect: Solution is essentially wrong, off-topic, or trivial.

This discretization enables standardized accuracy and error calculation. Model prompts mimic IMO grading scenarios: the AI only receives the problem and proposed solution—no reference answer or guideline—mirroring blind peer review (Luong et al., 3 Nov 2025).

4. Evaluation Protocols and Metrics

IMO-GradingBench introduces two principal metrics for autograder evaluation:

Accuracy: Fraction of solutions where the model prediction matches the human-assigned category.
Mean Absolute Error (MAE): Average absolute difference between model and human grades, mapped to canonical category scores (7 for Correct, 6 for Almost, 1 for Partial, 0 for Incorrect).

The protocol mandates that model outputs must end with a single rubric word ("correct", "almost", "partial", "incorrect"), facilitating automated score extraction and direct comparison with expert scores.

5. Use in Autograder Development and Validation

The benchmark serves as a testbed for autograder systems—AI agents or LLMs tasked with grading mathematical proof solutions. It supports development and fine-tuning of models intended for:

Educational feedback applications, where rapid, nuanced assessment is required;
Research into model "understanding" of mathematical argumentation and error patterns;
Quantitative benchmarking of mathematical meta-reasoning, distinct from answer-finding or proof-writing.

For evaluation on IMO-GradingBench, autograders are not given reference solutions or detailed grading guidelines; they must infer correctness and assign a categorical grade based solely on the presented argument (Luong et al., 3 Nov 2025).

6. Results, Analysis, and Limitations

Reported autograder performance highlights the difficulty of human-aligned grading:

Model	Accuracy	MAE
Gemini 2.5 Pro	44.3%	30.2%
SOTA non-Gemini	50.2%	18.4%
Gemini 2.5 Deep Think	52.5%	20.5%
o3 (OpenAI)	54.0%	20.2%

Key observations:

State-of-the-art models match human grading at best ~54% of the time, underscoring the challenge of nuanced proof evaluation;
Most errors occur between "partial" and "incorrect", revealing limits in recognizing borderline or pedagogical distinctions;
Models rarely confuse "correct" and "incorrect", suggesting reliable recognition of gross validity but difficulty with subtler errors.

When autograders are supplied with reference solutions and grading guidelines (as in the related IMO-ProofBench evaluation), far higher correlations with expert judgment are reported (up to 0.96 Pearson), indicating that the core difficulty lies in the blind evaluation setting provided by IMO-GradingBench (Luong et al., 3 Nov 2025).

7. Impact and Future Directions

IMO-GradingBench marks a critical advance in automated evaluation protocols for proof-based mathematical reasoning at the Olympiad level:

It enables scalable research on autograder reliability, partial credit assignment, and diagnostic coverage;
By establishing "north-star" metrics for meta-reasoning, it provides the community with a concrete target for progress beyond answer-centric evaluation;
Its open release (https://imobench.github.io/) encourages replication, fine-tuning, and benchmarking for both educational and research AI systems.

This suggests that future work will focus on improving nuance recognition, pedagogical understanding, and robustness of automated graders, toward expert-level performance in mathematical solution assessment.

References

Towards Robust Mathematical Reasoning (Luong et al., 3 Nov 2025)
GenesisGeo: Technical Report (Zhu et al., 26 Sep 2025)
Wu's Method can Boost Symbolic AI to Rival Silver Medalists and AlphaGeometry to Outperform Gold Medalists at IMO Geometry (Sinha et al., 2024)
Towards Solving More Challenging IMO Problems via Decoupled Reasoning and Proving (Liang et al., 7 Jul 2025)
CombiGraph-Vis: A Curated Multimodal Olympiad Benchmark for Discrete Mathematical Reasoning (Mahdavi et al., 31 Oct 2025)
A Lean Dataset for International Math Olympiad: Small Steps towards Writing Math Proofs for Hard Problems (Yousefzadeh et al., 2024)
Gemini 2.5 Pro Capable of Winning Gold at IMO 2025 (Huang et al., 21 Jul 2025)
RIMO: An Easy-to-Evaluate, Hard-to-Solve Olympiad Benchmark for Advanced Mathematical Reasoning (Chen et al., 9 Sep 2025)