Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash 91 tok/s
Gemini 2.5 Pro 49 tok/s Pro
GPT-5 Medium 31 tok/s
GPT-5 High 36 tok/s Pro
GPT-4o 95 tok/s
GPT OSS 120B 478 tok/s Pro
Kimi K2 223 tok/s Pro
2000 character limit reached

LeanGeo-Bench: Automated Geometry Benchmark

Updated 23 August 2025
  • LeanGeo-Bench is a formal benchmark for assessing automated geometric reasoning and formal proof production using Lean 4 and specialized tactics.
  • It integrates diverse problem sources from Olympiad, high school, and synthetic datasets into a unified, hierarchical framework for geometry.
  • Benchmark results reveal that LLMs perform well on simpler tasks but struggle with intricate, competition-level problems requiring deep deductive reasoning.

LeanGeo-Bench is a formal benchmark designed for evaluating automated and LLM performance in competition-level geometry, situated within the LeanGeo project and built on the Lean 4 theorem prover. It integrates principled formalization with a challenging suite of problems drawn from international Olympiads, high school competitions, and synthetic sources, enabling rigorous assessment of deductive geometric reasoning, formal proof production, and cross-domain mathematical integration.

1. Formal Framework of LeanGeo

LeanGeo provides a unified formal system for expressing and verifying plane geometry problems within Lean 4, with particular emphasis on bridging intuitive geometric reasoning and proof formalization. The library consists of hierarchical geometric definitions, including constructs such as Midpoint, Circumcenter, and RadicalAxis, and a repository of 260 formally proved theorems that range from foundational triangle properties to advanced statements originating from International Mathematical Olympiad (IMO) problems. Proofs within LeanGeo adopt a declarative style, built from simple axioms toward complex results through specialized tactics—such as euclid_intros, euclid_apply, and euclid_finish—that mirror human mathematical practice. To overcome obstacles inherent to geometric formalization, LeanGeo utilizes explicit configuration encoding for geometric cases and leverages external SMT solvers for automating routine deduction steps.

2. Integration with Lean 4 and Mathlib

LeanGeo is implemented natively in Lean 4, utilizing the comprehensive logic and tactic framework of the theorem prover along with seamless access to the Mathlib mathematical library. Basic geometric axioms are formalized as a foundation (drawing from projects such as LeanEuclid), upon which higher-level geometry theorems are derived. This layered construction supports proofs that span multiple mathematical fields; for example, formal proofs for IMO problems can invoke trigonometric identities from Mathlib to encode angle relationships. Automation is further assisted by LeanSMT, where local hypotheses are dispatched to the CVC5 solver and SMT-generated steps are cached, streamlining repetitive segments of the proof and enabling tractable formalization of more intricate competition problems.

3. Composition and Characteristics of LeanGeo-Bench

LeanGeo-Bench comprises 122 formalized geometry problems curated from several distinguished sources:

Source Number of Problems Notes
UniGeo corpus 10 Foundational, LeanEuclid base
LeanGeo theorem lib 10 Manually formalized
Gemini_synthetic 20 Synthetic, generation pipeline
NuminaMath HSC 20 High school competitions
Olympic textbook 19 Autoformalized, advanced
IMO 2000–present 43 Autoformalized, human-reviewed

Problems span topics such as triangle geometry, circle configuration, quadrilaterals, central points (incenter, circumcenter), and, in select cases, geometric inequalities. LeanGeo-Bench includes both synthetic and authentic competition tasks, with many requiring nontrivial derivation of angle and segment relationships. A distinct feature is the provision of human-readable formal proofs embedded in Lean’s tactic language—enhancing accessibility for both automated tools and human practitioners. For instance, the proof of an isosceles triangle result is given as:

1
2
3
4
5
6
7
8
9
10
theorem isoTriangle_imp_eq_angles : ∀ (A B C : Point),
  IsoTriangle A B C →
  (∠ A:B:C = ∠ A:C:B) := by
    euclid_intros
    euclid_apply exists_midpoint B C as D
    euclid_apply line_from_points B C as BC
    euclid_apply coll_angles_eq
    euclid_apply congruentTriangles_SSS D B A D C A
    euclid_apply coll_angles_eq
    euclid_finish

4. LLM Evaluation Methodology and Results

LeanGeo-Bench serves as a standardized testbed for assessing the formal proof-generating capabilities of LLMs. Models are evaluated on pass@𝑘 scores (the proportion of problems solved within 𝑘 samples), with performance further delineated by problem category. Baseline results from recent studies show the following trends:

  • o4-mini achieves approximately 19.67% pass@1 and 22.13% pass@4.
  • Gemini 2.5 Pro attains 17.21% pass@1 and 27.05% pass@4.
  • Performance on synthetic and high school level problems is higher than on Olympic-grade problems; notably, none of the tested LLMs solve Olympic-level items.
  • Overall, current LLM methods exhibit <30% success rates and struggle with high-complexity formal geometric argumentation.

This presents a systematic validation of both the approachability and difficulty spectrum inherent in LeanGeo-Bench and illuminates current limitations in LLM-driven formal geometric reasoning.

5. Soundness, Automation, and Benchmark Extension

Efforts to strengthen LeanGeo’s trustworthiness focus on internalizing external SMT solver certificates into the Lean kernel, enforcing complete end-to-end verification. Current reliance on CVC5 outputs serves to automate trivial proof steps, but full formal replay within Lean is targeted for future development. To bolster performance on complex or large-scale geometric tasks, future enhancements encompass embedding domain-specific methods—such as the Area Method—for geometric deduction, and optimizing synthetic data generation, as only 14% of theorem–proof pairs currently pass formal verification at initial stages.

A proposed reinforcement learning-based instilling method aims to teach LLMs effective theorem selection by presenting randomly sampled library subsets and gradually refining external knowledge integration. This seeks to address performance loss due to excessively lengthy prompts and to facilitate improved cold-start capability.

6. Open Source Resources and Usage Guidelines

The complete LeanGeo theorem library and LeanGeo-Bench benchmark are open-sourced at https://github.com/project-numina/LeanGeo/tree/master. To utilize the framework, researchers can: clone the repository, initialize the LeanGeo library in Lean 4, examine supplied proof scripts (including IMO examples), leverage benchmark problems for model training or evaluation, and consult documentation for extending the system or integrating improved SMT features. The repository’s public availability encourages contributions and collaborative research in formal geometry solving.

7. Significance in Automated Geometric Reasoning

LeanGeo-Bench provides a rigorous and standardized methodology for the evaluation of automated geometric reasoning systems at the level of international competitions. By combining a formally reviewed, multi-source benchmark with a human-readable tactic framework and explicit error handling, it facilitates progress in robust geometric proof automation—benchmarking both symbolic and LLMs. Its application reveals the current state-of-the-art performance bounds and highlights key areas (formal soundness, theorem selection, synthetic data generation) for targeted research advancement in AI-assisted formal mathematics.

This suggests that LeanGeo-Bench will remain pivotal in driving the integration of formal methods and advanced model-based reasoning, setting a bar for performance in competition-grade geometric problem solving.