Gaokao-Formal Benchmark: Autoformalization & Proof
- Gaokao-Formal benchmark is a curated collection of math problems from Gaokao exams paired with human-verified Lean 4 formalizations and English translations.
- It spans diverse domains such as functions, sequences, inequalities, and geometry, enabling rigorous evaluation of both autoformalization and end-to-end proof synthesis.
- The benchmark employs semantic metrics like Lean Check and LeanScorer to standardize evaluation, setting a new challenge for formal AI research in theorem proving.
The Gaokao-Formal benchmark is a curated suite of proof-oriented mathematical problems drawn from the Chinese National College Entrance Examination (Gaokao), systematically constructed to evaluate and advance the state of natural-language–to–formal theorem proving (Xuejun et al., 8 Jun 2025). It uniquely addresses the “informal-to-formal” gap by providing parallel natural language, English translation, and Lean 4 formal statements for each item, enabling rigorous testing of both autoformalization and end-to-end formal reasoning pipelines. Encompassing a diverse mix of mathematical domains—functions, sequences, inequalities, analytic and synthetic geometry, and combinatorics—Gaokao-Formal represents an unprecedented real-world challenge set for formal AI research.
1. Benchmark Construction and Design
Data Collection and Curation
The foundation of Gaokao-Formal is the systematic extraction of proof-style mathematics problems from every year’s national Gaokao (2008–2024), with inclusion criteria targeting all items that require a formal-style proof. Geometry and combinatorics problems are explicitly retained, avoiding domain bias inherent in narrower symbolic logic sets.
Each problem is:
- Translated from native Chinese to English, checked by human annotators.
- Paired with a human-verified formalization in Lean 4, with careful attention to encoding all hypotheses and goals.
This dual-lane annotation results in a parallel corpus suitable for both translation-to-formal modeling and downstream proof synthesis.
Problem Coverage and Structure
Table 1 presents the domain distribution, reflecting Gaokao’s diversity and emphasis on functional and sequence reasoning:
| Domain | # Problems |
|---|---|
| Functions | 167 |
| Sequences & Series | 150 |
| Inequality | 28 |
| Trigonometry | 22 |
| Analytic Geometry | 71 |
| Probability & Combinatorics | 4 |
| Comprehensive | 46 |
| Total | 488 |
Problems typically range from 60–100 English words and often contain multi-part subquestions. The “comprehensive” category aggregates multi-domain questions or those introducing new definitions beyond routine syllabi.
Input–Output Format
Each instance provides:
- Natural-language statement (in Chinese and English).
- Corresponding Lean 4 theorem declaration (hypotheses as variables and assumptions; “by sorry” as placeholder goal).
- Example (translated and formalized):
Natural Language:
“Let be a positive integer, and let be an arithmetic sequence with nonzero common difference. Remove two terms so that the remaining $4m$ terms can be partitioned into blocks of size $4$, each of which is arithmetic. Prove that the original sequence is -separable.”
Lean 4 Formalization (excerpt):
1 2 3 4 |
theorem gaokaoformal_g4 (m : ℕ) (hm : 1 ≤ m) (a : ℕ → ℝ) (ha : ∃ d ≠ 0, ∀ n, a (n + 1) = a n + d) (sep : ℕ × ℕ → Prop) (h_sep : ...) : m ≥ 3 → sep 2 13 := by sorry |
2. Dataset Characteristics and Domain Analysis
The dataset’s multi-domain nature reflects the Gaokao’s mathematical scope. The largest categories are Functions (167) and Sequences/Series (150), aligning with the exam’s curriculum. Synthetic geometry is substantively represented (71), and the comprehensive category is used for multi-topic or novel-definition questions.
No explicit per-question “hardness” labels are assigned; however, combinatorics and geometry items empirically exhibit greater difficulty, often lacking direct lemma support in mathlib.
The relatively small incidence of combinatorics (4 items) and probability reflects topical weighting in Gaokao rather than curation bias. Subquestion structure (2–4 parts) mirrors the format encountered by human candidates.
3. Evaluation Protocols and Metrics
Formalization and Proof Generation
For each item, evaluation proceeds in two primary phases:
- Autoformalization: Natural-language to Lean 4 formal statement. Correctness is assessed both syntactically and semantically.
- End-to-end Theorem Proving: Generating a Lean 4 tactic script that fills the “by sorry” placeholder with a valid, machine-checkable proof of the formalized statement.
Quantitative Metrics
- Lean Check (LC): Fraction of model outputs that parse and type-check.
- LeanScorer Semantic Check (LSC): A fuzzy-integral–based score (0–1), with a pass mark, filtering outputs that are only structurally but not semantically faithful.
- pass@k: For sampled outputs, correct (by LC+LSC), the likelihood that at least one of draws is correct:
Reported for (autoformalization) and (end-to-end).
- End-to-End Accuracy: Fraction of problems fully proved within a fixed attempt budget.
4. Experimental Results and Error Analysis
Autoformalizer Performance
The Mathesis-Autoformalizer achieves substantial gains in both syntactic and semantic success rates, especially when augmented with Hierarchical Preference Optimization (HPO):
| Model | LC@6 | LC+LSC@6 |
|---|---|---|
| Kimina-7B | 91% | 49% |
| Mathesis-Autoformalizer | 98% | 67% |
| Mathesis-Autoformalizer + HPO | 98% | 71% |
This represents a 22-point absolute improvement in semantic pass rate and a 45% reduction in formalization errors relative to prior best.
End-to-End Proof Synthesis
Parallel application of autoformalization and proof-search yields the following pass@32 rates on Gaokao-Formal:
| Prover / Autoformalizer | pass@32 |
|---|---|
| DeepSeek-V2-7B + Kimina | 11.2% |
| DeepSeek-V2-7B + Mathesis-HPO | 16.8% |
| Mathesis-Prover-7B + Mathesis-HPO | 18.0% |
The full Mathesis pipeline improves accuracy from 11.2% to 18.0%, a 60% relative increase, demonstrating the impact of advanced autoformalization.
Failure Modes
Characteristic errors include:
- Omitted quantifier scoping (e.g., missing range specifiers in sequence problems).
- Goal restatement or assumption leakage (circular formalization).
- Incorrect index management in sums/products.
- “Trivial” proofs (e.g., “True := by sorry”) that satisfy the type-checker but are vacuous.
- Geometry proofs failing due to incorrect or ambiguous correspondence between elements.
Geometry and “comprehensive” problems are especially challenging; a plausible implication is that further domain-specific language modeling is required.
5. Comparative Context and Limitations
Gaokao-Formal’s contribution is unique among benchmarks. Unlike AGIEval, GAOKAO-Bench, or GAOKAO-Eval, which focus on natural language question answering or multiple-choice/short-answer alignment, Gaokao-Formal tests full-spectrum mathematical formalization and proof generation. Its natural–formal alignment and semantic-check metrics explicitly evaluate both linguistic and logical aspects of AI mathematicians.
However, the set size—488 problems—is relatively small for deep learning scale. Coverage is skewed toward certain topics (notably few combinatorics/probability items), and no explicit difficulty gradation is provided. Only Lean 4 is used; the benchmark does not capture cross-system robustness.
6. Implications and Future Directions
Gaokao-Formal establishes an end-to-end challenge for natural-language–to–formal-theorem-proving pipelines, substantiating the critical role of precise formalization as a bottleneck in automated reasoning. The marked gains from Mathesis’s reinforcement-learning–driven autoformalizer suggest that further learning-centered improvements in informal-to-formal methods could substantially elevate downstream proof synthesis.
Key extensions include:
- Expanding the benchmark to increase problem count and diversity (in particular, combinatorics, probability, and diagrammatic geometry).
- Annotating sub-question difficulty to support curriculum or adaptive learning research.
- Exploring multi-lingual and multimodal extensions, potentially leveraging diagrammatic and interactive inputs.
- Broadening proof assistant coverage (e.g., Isabelle/HOL, Coq) to allow comparative assessment.
Gaokao-Formal complements the landscape of real-world, high-stakes mathematical benchmarks, providing a critical testbed for evaluating and advancing the capabilities of LLMs and automated theorem provers in mathematical reasoning under formal semantics.