IMO-AG-30 Dataset: Geometry Theorem Benchmark

Updated 5 December 2025

IMO-AG-30 is a benchmark comprising 30 IMO geometry problems translated into a formal DSL for rigorous automated evaluation.
The dataset assesses diverse methods, including synthetic, algebraic, and LLM-assisted approaches, with performance metrics aligned with human medalist levels.
It serves as a standard testbed to identify strengths and limitations of hybrid reasoning methods in complex geometry problem-solving.

The IMO-AG-30 dataset is a curated benchmark of 30 Euclidean geometry problems, encompassing all geometry questions that appeared in the International Mathematical Olympiads (IMO) over a 23-year span. Developed primarily for the evaluation of theorem-proving systems at the olympiad level, it functions as a critical resource for assessing the capabilities of both symbolic and neuro-symbolic solvers in automated geometry reasoning, with problem statements rigorously formalized in logic-based domain-specific languages (DSLs) suitable for advanced proof engines. The dataset has become the de facto standard for measuring progress in automated olympiad-level geometry, underlining specific strengths and limitations of synthetic, algebraic, and LLM–driven approaches (Zhang et al., 14 Dec 2024, Sinha et al., 9 Apr 2024).

1. Benchmark Composition and Scope

IMO-AG-30 consists of exactly 30 problems, each sourced from the IMO geometry corpus between 2000 and 2021. Every contained problem was manually translated from its original English statement into the AlphaGeometry domain-specific language, which encodes geometric constructions (points, lines, circles) and formalizes the goal in proof-assistant–readable syntax. There is no per-problem topic tagging, standardized difficulty labelling, or supplementary annotation; the only attached metadata constitutes the DSL script specifying the formalized configuration and the target theorem statement.

Officially reported human baselines give a direct measure of problem difficulty: the average IMO contestant solves 15.2 of 30 problems, a bronze medalist 19.3, a silver medalist 22.9, and a gold medalist 25.9 (Zhang et al., 14 Dec 2024, Sinha et al., 9 Apr 2024). The hardest problems in the set are IMO 2000 P6 and IMO 2008 P6, which require advanced auxiliary constructions and benefited most from learning-based tree guidance.

A summary table consolidating human and system performance appears below:

Method	Problems Solved / 30
GPT-4 (o)	0
DD+AR (AlphaGeometry)	14
Average IMO contestant	15.2
DD (TongGeometry, no value)	18
Bronze medalist	19.3
Silver medalist	22.9
AlphaGeometry (DD+AR)	25
Gold medalist	25.9
TongGeometry (w/o value)	28
TongGeometry (full)	30

This table reflects the number of problems solved within a standardized 90-minute per-problem time limit and typical modern hardware budgets, unless otherwise stated (Zhang et al., 14 Dec 2024).

2. Problem Formalization and Encoding Practices

Each IMO-AG-30 problem is specified in a formal DSL tailored to the reasoning system under evaluation. For AlphaGeometry and TongGeometry, the DSL encodes both geometric construction primitives (e.g., ExtendEqual(A, I), IntersectLineLine(X, e, Y, f)) and inference assertions (e.g., eqangle, cong, simtri). No formal grammar is published in the benchmark itself; the approach is reflective of internal proof assistant architectures.

Alternative symbolic encodings are implemented for Wu’s method, which expresses each configuration as a coordinate assignment, a sequence of polynomial and bilinear constraints (hypotheses), and a statement of the desired conclusion in polynomial form. This structure is reflected in the JGEX .gex input format. Examples include explicit expressions for intersection points, algebraic angle equalities, collinearity determinants, and power-of-a-point polynomial equalities (Sinha et al., 9 Apr 2024).

Illustrative DSL and symbolic encodings are present in the technical appendices of the cited works, but no diagrammatic or SVG data is included at inference time—only the textual DSL or coordinate constraints are used as solver input.

3. Dataset Handling: Preprocessing, Annotation, and Release Protocol

All 30 benchmark problems were manually translated from natural-language (English) IMO statements into the AlphaGeometry DSL. No automated diagram parsing, additional normalization, or machine-vision pipeline is involved, and no auxiliary-construction hints or lemma annotations are associated with the dataset.

No train/val/test split is defined; IMO-AG-30 is used exclusively as a testbed for method evaluation. There is no evidence in the dataset papers of fine-tuning or learning on these problems, reinforcing the benchmark’s role as a pure generalization challenge (Zhang et al., 14 Dec 2024).

In the context of JGEX/Wu’s method, each problem is encoded in .gex format. While four of the 30 problems could not be represented in JGEX due to unsupported constructions, all are valid for the DSL-based neuro-symbolic systems.

4. Solution Statistics, Baseline Methods, and Evaluation Protocols

End-to-end solve rates under different symbolic and neuro-symbolic pipelines have become the dataset’s principal metric. The following ensemble and baseline approaches have been systematically compared:

Wu’s method (JGEX implementation): Solves 15/26 (58%) of the representable problems within a five-minute CPU-only budget; excels on collinearity/concurrency, distance, and area–based statements.
DD+AR (deductive database + angle/rate chasing): Solves 17/30 (57%) without LLM augmentation; covers most synthetic, human-style cases.
AlphaGeometry (DD+AR+LLM): Solves 25/30 (83%); nontrivially outperforms all previous symbolic baselines.
Wu+DD+AR ensemble: Solves 21/30 (70%); demonstrates the complementarity of algebraic and synthetic pipelines.
Wu+AlphaGeometry ensemble: Solves 27/30 (90%); exceeds the average gold-medalist human performance.

A methodological tabulation of key approaches is presented below:

System	Setting	Solves
Wu’s (JGEX)	5 min CPU limit, 16GB RAM, 26/30 in format	15 / 26
DD+AR	AlphaGeometry codebase	17 / 30
AlphaGeometry (AG)	LLM-augmented DD+AR	25 / 30
Wu+DD+AR	Parallel ensemble	21 / 30
Wu+AlphaGeometry	Parallel ensemble	27 / 30
TongGeometry (full)	Actor–critic, 32 CPU + 1x4090, 90min prob	30 / 30

Timing analysis shows that Wu’s method either solves problems in seconds or fails due to memory/time blowup, highlighting "all-or-nothing" behavior. Synthetic methods show steadier but narrower coverage. Actor–critic search with value-guided heuristics (as in TongGeometry) is required for exhaustive coverage, particularly on the hardest instances like IMO 2000 P6 and IMO 2008 P6 (Zhang et al., 14 Dec 2024, Sinha et al., 9 Apr 2024).

5. Empirical Patterns and Methodological Implications

The IMO-AG-30 benchmark exposes dual strengths and gaps among contemporary geometric reasoning engines:

Algebraic elimination (Wu’s method): Superior for problems reducible to pure polynomial constraints (collinearity, concurrency, distance/area invariants), often proving results in milliseconds to seconds, with explicit computation of non-degeneracy conditions.
Synthetic reasoning (DD+AR): Superior for angle-chasing, ratio-based assertions, and cases involving non-algebraic points (angle bisectors, incenters).
LLM-assisted construction (AlphaGeometry, TongGeometry): Critical for discovering non-obvious auxiliary constructions absent from fixed symbolic pipelines, and for bridging the gap between the breadth of synthetic methods and the mechanistic rigor of algebraic eliminations.

The complementarity of methods is evident: Wu’s approach resolves cases that are algebraically tractable but synthetically opaque, while AlphaGeometry and its derivatives outperform pure symbolic systems through learning-driven auxiliary suggestion and value-guided search.

Actor–critic reinforcement and value heuristics in TongGeometry enable full coverage, setting a new state-of-the-art by solving all 30 problems with feasible resource requirements on consumer-grade hardware (Zhang et al., 14 Dec 2024). A plausible implication is that further progress will favor hybrid approaches, blending algebraic elimination’s precision with the generalization benefits of large-scale model-guided construction and proof planning.

6. Benchmarking Practices and Standardization

IMO-AG-30 serves as a standardized, purely held-out evaluation set; no training or tuning on the benchmarked items is documented. All current evaluations use uniform compute limits per problem (90 minutes for TongGeometry, five minutes for Wu’s method), with clearly specified hardware configurations. This strict protocol ensures that comparative claims about AI and human competitiveness in olympiad geometry are robust and repeatable. It further positions IMO-AG-30 as the premier challenge suite for the geometry theorem-proving research community (Zhang et al., 14 Dec 2024, Sinha et al., 9 Apr 2024).