FormalMATH Benchmark

Updated 15 November 2025

FormalMATH Benchmark is a comprehensive Lean 4 evaluation suite that rigorously tests automated and AI-assisted theorem proving across high-school and undergraduate mathematics.
It integrates natural-language autoformalization, multi-model semantic verification, and human oversight to ensure formal correctness and scalability.
The benchmark reveals current AI provers' strengths and limitations, emphasizing domain imbalances and the impact of chain-of-thought strategies on formal reasoning.

FormalMATH Benchmark is a large-scale, Lean 4-based evaluation suite for formal mathematical reasoning, designed to assess and drive advances in automated and AI-assisted theorem proving. It integrates an extensive problem corpus—spanning high-school Olympiad style challenges through undergraduate-level mathematics across six major domains—with a rigorous, partially automated construction and verification pipeline. By combining natural-language autoformalization, multi-model semantic vetting, and systematic human oversight, FormalMATH provides both a robust assessment of current model capabilities and a foundation for future research in formal mathematics and automated reasoning (Yu et al., 5 May 2025).

1. Corpus Scope and Structure

The FormalMATH benchmark consists of 5 560 formally verified mathematical statements, all encoded in Lean 4 and spanning a broad range of sources:

Source Difficulty and Origin: 91.5 % of the problems are extracted from high-school Olympiad contests; the remaining 8.5 % are taken from undergraduate-level mathematics. Natural-language sources include standard contest repositories as well as undergraduate textbooks.
Domain Coverage: The problems are stratified across several core mathematical domains as follows:

| Domain | Count | Percent | |-----------------------|-------|---------| | Algebra | 1250 | 22.5% | | Number Theory | 1080 | 19.4% | | Discrete Mathematics | 1000 | 18.0% | | Calculus | 840 | 15.1% | | Applied Mathematics | 780 | 14.0% | | Geometry | 610 | 11.0% |

Examples by Domain: Statements range from complex algebraic inequalities, high-order calculus derivatives, and modular number-theoretic characterizations, to discrete graph-theoretic and asymptotic analytic assertions, all given both in natural language and their corresponding Lean 4 formalization.

2. Autoformalization and Verification Pipeline

FormalMATH employs a multi-stage, human-in-the-loop pipeline, leveraging both LLMs and expert humans to maximize semantic fidelity while minimizing annotation overhead.

Autoformalization Models: Generalist and code-oriented LLMs (e.g., GPT-4, Qwen2.5-7B-Coder, DeepSeek-prover-base) generate candidate Lean 4 formalizations via best-of- $N$ sampling; only syntax-correct outputs are proceeded with.
Semantic Verification: Each valid Lean statement is passed to multiple LLM-based semantic validators, which back-translate the formalization to natural language using chain-of-thought (CoT) prompting and check for alignment with the original prompt. Only unanimously aligned statements are retained.
Negation-Based Disproof Filtering: Remaining statements are tested for vacuity or error via negation: the candidate theorem $T$ is logically negated $\neg T$ by pushing negations through quantifiers, and an LLM-prover attempts to prove $\neg T$ in Lean 4. If successful, the original $T$ is discarded. This step removed roughly 1.6 % of candidates.
Human Verification: 12 expert annotators review the post-filtered statements for semantic correctness. The pipeline preserves 72.09 % of statements at this final step, with manual verification costing $\$6.89$ per statement and taking a total duration of 22 days.</li> </ul> <p>This pipeline robustly aligns formalizations with their natural-language intent and scales annotation via automation-assisted curation.</p> <h2 class='paper-heading' id='benchmark-evaluation-protocol'>3. Benchmark Evaluation Protocol</h2> <p>Evaluation of theorem provers on FormalMATH uses Lean 4's compiler as the definitive checker for formal proof correctness, with provers compared using the Pass@$K $metric:</p> <p>$ \mathrm{Pass@}K = \frac{1}{|\mathcal P|}\;\left|\left\{\,p\in \mathcal P\mid\exists\ \text{proof among top }K\text{ model outputs}\right\}\right| $</p> <ul> <li><strong>Prover Families</strong>: <ul> <li><strong>Best-First Search (BFS)-based</strong>: DeepSeek-Prover-V1.5-RL, InternLM-V2.5-Prover, BFS-Prover.</li> <li><strong>Single-Pass Generation (SPG)</strong>: Kimina-Prover-7B, STP, DeepSeek-Prover-V1.5-SFT, Goedel-Prover.</li> </ul></li> <li><strong>Typical Evaluation Budgets</strong>: E.g.,$ K=32 $for SPG methods;$ 1 \times 32 \times 100$ denotes 1 BFS run, 32 tactics/expansion, 100 expansions.</li> <li><strong>Domain Breakdown</strong>: Provers are also assessed across the benchmark's domain partitions.</li> </ul> <h2 class='paper-heading' id='quantitative-results-and-domain-analysis'>4. Quantitative Results and Domain Analysis</h2> <p>Current LLM-based provers show limited success on the FormalMATH benchmark, with pronounced domain bias and success plateauing even under large sampling budgets:</p> <div class='overflow-x-auto max-w-full my-4'><table class='table border-collapse w-full' style='table-layout: fixed'><thead><tr> <th>Method</th> <th>Budget</th> <th>Pass@$K $(%)</th> </tr> </thead><tbody><tr> <td>Kimina-Prover-7B</td> <td>$ K=32 $</td> <td>16.46</td> </tr> <tr> <td>BFS-Prover</td> <td>$ 1 \times 32 \times 100 $</td> <td>11.13</td> </tr> </tbody></table></div> <p>Domain-specific results for representative methods (Goedel-Prover,$ K=32 $) display a clear disparity:</p> <div class='overflow-x-auto max-w-full my-4'><table class='table border-collapse w-full' style='table-layout: fixed'><thead><tr> <th>Domain</th> <th>Pass@32 (%)</th> </tr> </thead><tbody><tr> <td>High-school Algebra</td> <td>17.5</td> </tr> <tr> <td>Undergraduate Algebra</td> <td>50.0</td> </tr> <tr> <td>Calculus (Differentiation)</td> <td>1.9</td> </tr> <tr> <td>Calculus (Integration)</td> <td>0.0</td> </tr> <tr> <td>Discrete Mathematics</td> <td>0.0</td> </tr> <tr> <td>Number Theory</td> <td>12.3</td> </tr> <tr> <td>Applied Mathematics</td> <td>29.4</td> </tr> </tbody></table></div> <p>Statistical analyses (two-proportion$ z $-tests) confirm a significant gap between easier (algebra, applied) and more challenging (calculus, discrete) domains ($ p<0.01 $). This suggests that current provers systematically exploit domain-specific automation, but do not generalize robustly to domains with less tactic support.</p> <h2 class='paper-heading' id='methodological-insights-chain-of-thought-and-natural-language-guidance'>5. Methodological Insights: Chain-of-Thought and Natural-Language Guidance</h2> <p>A key finding is the non-monotonic effect of human-written natural-language guidance within the chain-of-thought (CoT) paradigm:</p> <ul> <li><strong>CoT Approaches</strong>: Model prompting protocols include vanilla (no CoT), interleaved Lean tactics with English comments (CoT), and NL-augmented CoT with human informal solutions.</li> <li><strong>Performance</strong>: Under massive sampling ($ K=3200 $), pure CoT achieves the highest Pass@K (50.6 %), vanilla 47.0 %, and NL-augmented CoT slightly lower (49.2 %). Increased NL guidance correlates negatively with proof success ($ \sim-0.85$), indicating that excessive natural-language solution detail introduces confusion (higher model perplexity) in formal reasoning. A plausible implication is that formal proof search and NL guidance have nontrivial interaction, where over-specification in NL may distract models from tractable formal strategies.

6. Systematic Limitations and Research Directions

Analysis of failure cases and pipeline performance identifies several challenges:

Low End-to-End Success: Success rates do not exceed 16.5 % at practical budgets, reflecting substantial room for improvement.
Severe Domain Imbalance: Algebra- and applied-subdomains are overrepresented among solvable problems; calculus and discrete mathematics remain largely intractable for LLM-based provers.
Reliance on Automation Tactics: Most successes exploit built-in Lean tactics such as aesop, simp, and linarith, indicating a lack of deep mathematical reasoning ability.
Proof Brittleness: Single failures in tactic selection or expression typing immediately abort the proof.
Test-time Scaling Saturation: Returns diminish beyond moderate sampling due to the above brittleness.
Research Proposals: The paper suggests (1) aligning NL CoT with Lean's tactic semantics, (2) richer reward shaping (using intrinsic and curriculum-based signals), (3) hybridizing single-pass generation and guided proof search, (4) balancing training data across domains, and (5) developing automated lemma retrieval for under-served fields, such as calculus.

7. Context, Impact, and Relationship to Other Benchmarks

FormalMATH situates itself among a new generation of rigorous, large-scale formal mathematics benchmarks. While earlier benchmarks (e.g., miniF2F (Zheng et al., 2021)) emphasized cross-system comparability on Olympiad-level problems, and FormalMATH incorporates challenging sub-benchmarks such as the OEIS-derived FormalMATH Inductive Subset (Gauthier et al., 2023), its broader domain and comprehensive pipeline provide a foundation for the next stage of automated mathematical reasoning.

Contemporary efforts such as FATE (Jiang et al., 4 Nov 2025) extend to research-level algebra and commutative algebra, pushing beyond the undergraduate and Olympiad scope by introducing novel definitions and requiring frontier-level abstraction, but with smaller problem counts. The FormalMATH construction pipeline, with its use of automated semantic verification and negation-based filtering, addresses the annotation scalability and semantic robustness required for such scale.

A plausible implication is that continued integration of automated synthesis, multi-model verification, and expert oversight—as exemplified by FormalMATH—will remain necessary to advance both the quality and coverage of formal problem corpora, especially as AI provers are expected to tackle more complex, research-level mathematical reasoning tasks.

PDF Markdown Chat (Pro)

References (4)

FormalMATH: Benchmarking Formal Mathematical Reasoning of Large Language Models (2025)

MiniF2F: a cross-system benchmark for formal Olympiad-level mathematics (2021)

A Mathematical Benchmark for Inductive Theorem Provers (2023)

FATE: A Formal Benchmark Series for Frontier Algebra of Multiple Difficulty Levels (2025)

Follow Topic

Get notified by email when new papers are published related to FormalMATH Benchmark.