Papers
Topics
Authors
Recent
2000 character limit reached

HAGeo-409: Olympiad Geometry Benchmark

Updated 5 December 2025
  • HAGeo-409 is a comprehensive benchmark comprising 409 Olympiad-level Euclidean geometry problems with quantitative human-assessed difficulty annotations.
  • It employs a rigorous pipeline—from automated language conversion and numeric verification to manual corrections—to ensure problem validity and precision.
  • Comparative evaluations reveal that heuristic methods significantly outperform random baselines and neural approaches, especially on harder geometry problems.

HAGeo-409 is a comprehensive benchmark designed for the rigorous evaluation of automated theorem-proving systems in Olympiad-level Euclidean geometry. It comprises 409 problems spanning a broad range of human-assessed difficulty levels, explicitly addressing deficiencies in earlier benchmarks by ensuring both greater scale and precise, quantitative difficulty annotation. HAGeo-409 supports geometry-specific representations compatible with both synthetic (e.g., DDAR) and neural network–based (e.g., AlphaGeometry) inference, enabling fair and transparent comparison across diverse solution paradigms (Duan et al., 27 Nov 2025).

1. Motivation and Rationale

Existing benchmarks, most notably IMO-30, consist of only 30 problems extracted from International Mathematical Olympiads (2000–2022). IMO-30 lacks formal quantitative difficulty annotation, and subject-matter experts have identified its problems as largely “easy,” with an average difficulty of approximately $2.85$ on a $1$–$7$ scale. This limited size and range results in high variance during evaluation and constrains the assessment of solver capability, particularly for harder instances. HAGeo-409 was constructed to curate a larger and more representative set of Olympiad-level geometry theorems, each annotated with a human-assigned difficulty score, thereby enabling granular analysis of system performance relative to problem hardness. In addition, HAGeo-409 adopts a geometry-specific, numerically verified encoding, ensuring compatibility with both established synthetic engines and neural architectures (Duan et al., 27 Nov 2025).

2. Problem Acquisition and Verification Pipeline

Problem selection for HAGeo-409 begins with two principal sources: the Art of Problem Solving (AoPS) contest archive, contributing over 2,000 geometry problems across national, regional, and mock Olympiads, and ShuZhiMi, a WeChat mini-program for high-school mathematics, which contributed approximately 50 additional user-rated, Olympiad-style problems.

The conversion and verification workflow operates as follows:

  1. Each natural-language problem statement is translated into a construction-based, GeoGebra-inspired language by prompting GPT-4o with few-shot examples.
  2. Preliminary, automated numeric verification is performed to rule out invalid or ill-posed theorems.
  3. Problems failing conversion or verification (~50 %) undergo manual correction and inspection.

Problems retained in the benchmark meet the criteria of possessing a geometry-specific encoding, successful numeric verification, and a human-assigned difficulty rating. The final collection covers the gamut of standard Olympiad geometry topics, including triangle centers (incenter, centroid, circumcenter, orthocenter), circle theorems, angle-chasing, concurrency, collinearity, similarity, and power of a point (Duan et al., 27 Nov 2025).

3. Human-Assessed Difficulty Annotation

Each HAGeo-409 problem is annotated with a difficulty score on a 1–7 scale (1 = “very easy,” 7 = “very hard”), based directly on arithmetic mean ratings from ShuZhiMi users. No further normalization or rubrics are applied; thus, scores reflect the unweighted crowd-sourced consensus. For reporting and analysis, problems are grouped into five bins: [1,3)[1,3), [3,4)[3,4), [4,5)[4,5), [5,6)[5,6), [6,7][6,7].

This detailed annotation framework offers fine-grained insight into solver performance across the difficulty spectrum, which is absent in previous datasets such as IMO-30 (Duan et al., 27 Nov 2025).

4. Dataset Construction and Representation

Problems are encoded in a domain-specific textual language inspired by GeoGebra, comprising primitive geometric constructions:

  • Point definitions: e.g., “A B C = triangle”
  • Lines: “l = line A B”
  • Circles: “ω = circle_center_point O P”
  • Intersections: “X, Y = intersection l ω”
  • Special constructions: midpoints, reflections, perpendicular feet, angle-equal loci, circumcenters, incenters, etc.

Each problem file concludes with a goal statement, such as “Prove: cong O A O K” or “Prove: collinear K O3 O6.” All geometric objects are referred by name; explicit coordinates are withheld from the solver. Numeric verification is used internally for solution validation.

For broader applicability, a manually corrected conversion into AlphaGeometry's point-only format is also provided, facilitating direct comparison between symbolic and neural approaches (Duan et al., 27 Nov 2025).

5. Statistical Overview

The benchmark's coverage and problem statistics are summarized below:

Difficulty Bin Number of Problems Proportion (%)
[1, 3) 161 39.4
[3, 4) 112 27.4
[4, 5) 71 17.4
[5, 6) 43 10.5
[6, 7] 22 5.4
  • Average difficulty: 3.47 (compared to IMO-30’s 2.85).
  • Automated conversion: ~50% of AoPS problems via GPT-4o succeeded; ~20% passed numeric verification automatically with the remainder revised manually.
  • Topic distribution: Triangle centers & concurrency (~30%), circle theorems & power of a point (~25%), angle-chasing & similarity (~20%), collinearity & ratio arguments (~15%), advanced topics (~10%) (Duan et al., 27 Nov 2025).

6. Evaluation Methodology

Evaluation is standardized as follows:

  • Computational environment: 64-core CPU for all DDAR-based and heuristic runs; 80GB A100 GPU leveraged exclusively for AlphaGeometry’s neural point proposer.
  • Per-problem runtime: Each DDAR invocation receives a 60-second time limit.
  • Aggregate experiment duration: Each experiment (e.g. pass@K evaluation) is limited to at most 1.5 hours on 64 cores.

Pass@K metric definition:

  • For each problem, up to KK different auxiliary-point augmentations may be attempted.
  • S(K)S(K) denotes the number of problems solved within KK trials.
  • pass@K (%)=100Ă—S(K)/409\mathrm{pass@}K~(\%) = 100 \times S(K) / 409.

In experimental evaluation, KK is typically set to 2048 and 8192 to paper solver scaling (Duan et al., 27 Nov 2025).

7. Baseline Performance and Comparative Analysis

HAGeo-409 enables systematic head-to-head comparison between neural (AlphaGeometry), random baseline, and heuristic methods (HAGeo). Success rates by difficulty bin and overall pass@K scores are summarized in the table below:

Method [1,3) [3,4) [4,5) [5,6) [6,7] Total (%)
AlphaGeometry 73.3% 39.3% 18.3% 4.7% 0.0% 43.3
Random @2048 78.9% 55.4% 18.3% 4.7% 0.0% 49.9
HAGeo @2048 87.6% 77.7% 40.8% 11.6% 4.5% 64.3
Random @8192 79.5% 61.6% 25.4% 7.0% 0.0% 53.3
HAGeo @8192 92.5% 83.0% 50.7% 16.3% 9.1% 70.2

Key findings include an absolute improvement of approximately 20% by HAGeo over the random baseline at K=2048K = 2048 (64.3% vs 49.9%) and 17% at K=8192K = 8192 (70.2% vs 53.3%). Against AlphaGeometry, HAGeo achieves overall success rates increasing from 43.3% to 64.3% at K=2048K = 2048 and 70.2% at K=8192K = 8192. Notably, these gains are amplified in the more difficult bins ([3,4), [4,5), [5,6), [6,7]), confirming the heuristic method’s generalization to challenging Olympiad problems (Duan et al., 27 Nov 2025).

8. Adoption and Research Utility

HAGeo-409 is publicly accessible, with each problem provided as a plain-text GeoGebra-style file and an accompanying table of difficulty scores. Recommended benchmarking involves implementing the established pass@K protocol with a 60-second time budget per DDAR trial, reporting success rates by difficulty bin and overall, and comparing results to the baseline performance table. Alternate encoding in point-only format is provided for users focused on neural architectures; thus, HAGeo-409 defines a reproducible and challenging benchmark for advancing automated geometric theorem proving (Duan et al., 27 Nov 2025).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Whiteboard

Follow Topic

Get notified by email when new papers are published related to HAGeo-409 Benchmark.