Papers
Topics
Authors
Recent
Assistant
AI Research Assistant
Well-researched responses based on relevant abstracts and paper content.
Custom Instructions Pro
Preferences or requirements that you'd like Emergent Mind to consider when generating responses.
Gemini 2.5 Flash
Gemini 2.5 Flash 154 tok/s
Gemini 2.5 Pro 48 tok/s Pro
GPT-5 Medium 36 tok/s Pro
GPT-5 High 33 tok/s Pro
GPT-4o 70 tok/s Pro
Kimi K2 184 tok/s Pro
GPT OSS 120B 437 tok/s Pro
Claude Sonnet 4.5 36 tok/s Pro
2000 character limit reached

Gaokao-Formal Benchmark: Autoformalization & Proof

Updated 15 November 2025
  • Gaokao-Formal benchmark is a curated collection of math problems from Gaokao exams paired with human-verified Lean 4 formalizations and English translations.
  • It spans diverse domains such as functions, sequences, inequalities, and geometry, enabling rigorous evaluation of both autoformalization and end-to-end proof synthesis.
  • The benchmark employs semantic metrics like Lean Check and LeanScorer to standardize evaluation, setting a new challenge for formal AI research in theorem proving.

The Gaokao-Formal benchmark is a curated suite of proof-oriented mathematical problems drawn from the Chinese National College Entrance Examination (Gaokao), systematically constructed to evaluate and advance the state of natural-language–to–formal theorem proving (Xuejun et al., 8 Jun 2025). It uniquely addresses the “informal-to-formal” gap by providing parallel natural language, English translation, and Lean 4 formal statements for each item, enabling rigorous testing of both autoformalization and end-to-end formal reasoning pipelines. Encompassing a diverse mix of mathematical domains—functions, sequences, inequalities, analytic and synthetic geometry, and combinatorics—Gaokao-Formal represents an unprecedented real-world challenge set for formal AI research.

1. Benchmark Construction and Design

Data Collection and Curation

The foundation of Gaokao-Formal is the systematic extraction of proof-style mathematics problems from every year’s national Gaokao (2008–2024), with inclusion criteria targeting all items that require a formal-style proof. Geometry and combinatorics problems are explicitly retained, avoiding domain bias inherent in narrower symbolic logic sets.

Each problem is:

  • Translated from native Chinese to English, checked by human annotators.
  • Paired with a human-verified formalization in Lean 4, with careful attention to encoding all hypotheses and goals.

This dual-lane annotation results in a parallel corpus suitable for both translation-to-formal modeling and downstream proof synthesis.

Problem Coverage and Structure

Table 1 presents the domain distribution, reflecting Gaokao’s diversity and emphasis on functional and sequence reasoning:

Domain # Problems
Functions 167
Sequences & Series 150
Inequality 28
Trigonometry 22
Analytic Geometry 71
Probability & Combinatorics 4
Comprehensive 46
Total 488

Problems typically range from 60–100 English words and often contain multi-part subquestions. The “comprehensive” category aggregates multi-domain questions or those introducing new definitions beyond routine syllabi.

Input–Output Format

Each instance provides:

  • Natural-language statement (in Chinese and English).
  • Corresponding Lean 4 theorem declaration (hypotheses as variables and assumptions; “by sorry” as placeholder goal).
  • Example (translated and formalized):

    Natural Language:

    “Let mm be a positive integer, and let a1,,a4m+2a_1,\dots,a_{4m+2} be an arithmetic sequence with nonzero common difference. Remove two terms ai,aja_i, a_j so that the remaining $4m$ terms can be partitioned into mm blocks of size $4$, each of which is arithmetic. Prove that the original sequence is (2,13)(2,13)-separable.”

    Lean 4 Formalization (excerpt):

1
2
3
4
theorem gaokaoformal_g4 (m : ℕ) (hm : 1 ≤ m) (a : ℕ → ℝ)
  (ha : ∃ d ≠ 0, ∀ n, a (n + 1) = a n + d)
  (sep : ℕ × ℕ → Prop) (h_sep : ...) :
  m ≥ 3 → sep 2 13 := by sorry

2. Dataset Characteristics and Domain Analysis

The dataset’s multi-domain nature reflects the Gaokao’s mathematical scope. The largest categories are Functions (167) and Sequences/Series (150), aligning with the exam’s curriculum. Synthetic geometry is substantively represented (71), and the comprehensive category is used for multi-topic or novel-definition questions.

No explicit per-question “hardness” labels are assigned; however, combinatorics and geometry items empirically exhibit greater difficulty, often lacking direct lemma support in mathlib.

The relatively small incidence of combinatorics (4 items) and probability reflects topical weighting in Gaokao rather than curation bias. Subquestion structure (2–4 parts) mirrors the format encountered by human candidates.

3. Evaluation Protocols and Metrics

Formalization and Proof Generation

For each item, evaluation proceeds in two primary phases:

  • Autoformalization: Natural-language to Lean 4 formal statement. Correctness is assessed both syntactically and semantically.
  • End-to-end Theorem Proving: Generating a Lean 4 tactic script that fills the “by sorry” placeholder with a valid, machine-checkable proof of the formalized statement.

Quantitative Metrics

  • Lean Check (LC): Fraction of model outputs that parse and type-check.
  • LeanScorer Semantic Check (LSC): A fuzzy-integral–based score (0–1), with a α=0.6\alpha=0.6 pass mark, filtering outputs that are only structurally but not semantically faithful.
  • pass@k: For nn sampled outputs, cc correct (by LC+LSC), the likelihood that at least one of kk draws is correct:

pass@k=1(nck)(nk)\text{pass@}k = 1 - \frac{\binom{n-c}{k}}{\binom{n}{k}}

Reported for k=1,6k=1,6 (autoformalization) and k=32k=32 (end-to-end).

  • End-to-End Accuracy: Fraction of problems fully proved within a fixed attempt budget.

4. Experimental Results and Error Analysis

Autoformalizer Performance

The Mathesis-Autoformalizer achieves substantial gains in both syntactic and semantic success rates, especially when augmented with Hierarchical Preference Optimization (HPO):

Model LC@6 LC+LSC@6
Kimina-7B 91% 49%
Mathesis-Autoformalizer 98% 67%
Mathesis-Autoformalizer + HPO 98% 71%

This represents a 22-point absolute improvement in semantic pass rate and a 45% reduction in formalization errors relative to prior best.

End-to-End Proof Synthesis

Parallel application of autoformalization and proof-search yields the following pass@32 rates on Gaokao-Formal:

Prover / Autoformalizer pass@32
DeepSeek-V2-7B + Kimina 11.2%
DeepSeek-V2-7B + Mathesis-HPO 16.8%
Mathesis-Prover-7B + Mathesis-HPO 18.0%

The full Mathesis pipeline improves accuracy from 11.2% to 18.0%, a 60% relative increase, demonstrating the impact of advanced autoformalization.

Failure Modes

Characteristic errors include:

  • Omitted quantifier scoping (e.g., missing range specifiers in sequence problems).
  • Goal restatement or assumption leakage (circular formalization).
  • Incorrect index management in sums/products.
  • “Trivial” proofs (e.g., “True := by sorry”) that satisfy the type-checker but are vacuous.
  • Geometry proofs failing due to incorrect or ambiguous correspondence between elements.

Geometry and “comprehensive” problems are especially challenging; a plausible implication is that further domain-specific language modeling is required.

5. Comparative Context and Limitations

Gaokao-Formal’s contribution is unique among benchmarks. Unlike AGIEval, GAOKAO-Bench, or GAOKAO-Eval, which focus on natural language question answering or multiple-choice/short-answer alignment, Gaokao-Formal tests full-spectrum mathematical formalization and proof generation. Its natural–formal alignment and semantic-check metrics explicitly evaluate both linguistic and logical aspects of AI mathematicians.

However, the set size—488 problems—is relatively small for deep learning scale. Coverage is skewed toward certain topics (notably few combinatorics/probability items), and no explicit difficulty gradation is provided. Only Lean 4 is used; the benchmark does not capture cross-system robustness.

6. Implications and Future Directions

Gaokao-Formal establishes an end-to-end challenge for natural-language–to–formal-theorem-proving pipelines, substantiating the critical role of precise formalization as a bottleneck in automated reasoning. The marked gains from Mathesis’s reinforcement-learning–driven autoformalizer suggest that further learning-centered improvements in informal-to-formal methods could substantially elevate downstream proof synthesis.

Key extensions include:

  • Expanding the benchmark to increase problem count and diversity (in particular, combinatorics, probability, and diagrammatic geometry).
  • Annotating sub-question difficulty to support curriculum or adaptive learning research.
  • Exploring multi-lingual and multimodal extensions, potentially leveraging diagrammatic and interactive inputs.
  • Broadening proof assistant coverage (e.g., Isabelle/HOL, Coq) to allow comparative assessment.

Gaokao-Formal complements the landscape of real-world, high-stakes mathematical benchmarks, providing a critical testbed for evaluating and advancing the capabilities of LLMs and automated theorem provers in mathematical reasoning under formal semantics.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)
Forward Email Streamline Icon: https://streamlinehq.com

Follow Topic

Get notified by email when new papers are published related to Gaokao-Formal Benchmark.