FormalMATH Benchmark
- FormalMATH Benchmark is a comprehensive Lean 4 evaluation suite that rigorously tests automated and AI-assisted theorem proving across high-school and undergraduate mathematics.
- It integrates natural-language autoformalization, multi-model semantic verification, and human oversight to ensure formal correctness and scalability.
- The benchmark reveals current AI provers' strengths and limitations, emphasizing domain imbalances and the impact of chain-of-thought strategies on formal reasoning.
FormalMATH Benchmark is a large-scale, Lean 4-based evaluation suite for formal mathematical reasoning, designed to assess and drive advances in automated and AI-assisted theorem proving. It integrates an extensive problem corpus—spanning high-school Olympiad style challenges through undergraduate-level mathematics across six major domains—with a rigorous, partially automated construction and verification pipeline. By combining natural-language autoformalization, multi-model semantic vetting, and systematic human oversight, FormalMATH provides both a robust assessment of current model capabilities and a foundation for future research in formal mathematics and automated reasoning (Yu et al., 5 May 2025).
1. Corpus Scope and Structure
The FormalMATH benchmark consists of 5 560 formally verified mathematical statements, all encoded in Lean 4 and spanning a broad range of sources:
- Source Difficulty and Origin: 91.5 % of the problems are extracted from high-school Olympiad contests; the remaining 8.5 % are taken from undergraduate-level mathematics. Natural-language sources include standard contest repositories as well as undergraduate textbooks.
- Domain Coverage: The problems are stratified across several core mathematical domains as follows:
| Domain | Count | Percent | |-----------------------|-------|---------| | Algebra | 1250 | 22.5% | | Number Theory | 1080 | 19.4% | | Discrete Mathematics | 1000 | 18.0% | | Calculus | 840 | 15.1% | | Applied Mathematics | 780 | 14.0% | | Geometry | 610 | 11.0% |
- Examples by Domain: Statements range from complex algebraic inequalities, high-order calculus derivatives, and modular number-theoretic characterizations, to discrete graph-theoretic and asymptotic analytic assertions, all given both in natural language and their corresponding Lean 4 formalization.
2. Autoformalization and Verification Pipeline
FormalMATH employs a multi-stage, human-in-the-loop pipeline, leveraging both LLMs and expert humans to maximize semantic fidelity while minimizing annotation overhead.
- Autoformalization Models: Generalist and code-oriented LLMs (e.g., GPT-4, Qwen2.5-7B-Coder, DeepSeek-prover-base) generate candidate Lean 4 formalizations via best-of- sampling; only syntax-correct outputs are proceeded with.
- Semantic Verification: Each valid Lean statement is passed to multiple LLM-based semantic validators, which back-translate the formalization to natural language using chain-of-thought (CoT) prompting and check for alignment with the original prompt. Only unanimously aligned statements are retained.
- Negation-Based Disproof Filtering: Remaining statements are tested for vacuity or error via negation: the candidate theorem is logically negated by pushing negations through quantifiers, and an LLM-prover attempts to prove in Lean 4. If successful, the original is discarded. This step removed roughly 1.6 % of candidates.
- Human Verification: 12 expert annotators review the post-filtered statements for semantic correctness. The pipeline preserves 72.09 % of statements at this final step, with manual verification costing $\$6.89$ per statement and taking a total duration of 22 days.</li> </ul> <p>This pipeline robustly aligns formalizations with their natural-language intent and scales annotation via automation-assisted curation.</p> <h2 class='paper-heading' id='benchmark-evaluation-protocol'>3. Benchmark Evaluation Protocol</h2> <p>Evaluation of theorem provers on FormalMATH uses Lean 4's compiler as the definitive checker for formal proof correctness, with provers compared using the Pass@$K\mathrm{Pass@}K = \frac{1}{|\mathcal P|}\;\left|\left\{\,p\in \mathcal P\mid\exists\ \text{proof among top }K\text{ model outputs}\right\}\right|K=321 \times 32 \times 100$ denotes 1 BFS run, 32 tactics/expansion, 100 expansions.</li> <li><strong>Domain Breakdown</strong>: Provers are also assessed across the benchmark's domain partitions.</li> </ul> <h2 class='paper-heading' id='quantitative-results-and-domain-analysis'>4. Quantitative Results and Domain Analysis</h2> <p>Current LLM-based provers show limited success on the FormalMATH benchmark, with pronounced domain bias and success plateauing even under large sampling budgets:</p> <div class='overflow-x-auto max-w-full my-4'><table class='table border-collapse w-full' style='table-layout: fixed'><thead><tr> <th>Method</th> <th>Budget</th> <th>Pass@$KK=321 \times 32 \times 100K=32zp<0.01K=3200\sim-0.85$), indicating that excessive natural-language solution detail introduces confusion (higher model perplexity) in formal reasoning. A plausible implication is that formal proof search and NL guidance have nontrivial interaction, where over-specification in NL may distract models from tractable formal strategies.
6. Systematic Limitations and Research Directions
Analysis of failure cases and pipeline performance identifies several challenges:
- Low End-to-End Success: Success rates do not exceed 16.5 % at practical budgets, reflecting substantial room for improvement.
- Severe Domain Imbalance: Algebra- and applied-subdomains are overrepresented among solvable problems; calculus and discrete mathematics remain largely intractable for LLM-based provers.
- Reliance on Automation Tactics: Most successes exploit built-in Lean tactics such as
aesop,simp, andlinarith, indicating a lack of deep mathematical reasoning ability. - Proof Brittleness: Single failures in tactic selection or expression typing immediately abort the proof.
- Test-time Scaling Saturation: Returns diminish beyond moderate sampling due to the above brittleness.
- Research Proposals: The paper suggests (1) aligning NL CoT with Lean's tactic semantics, (2) richer reward shaping (using intrinsic and curriculum-based signals), (3) hybridizing single-pass generation and guided proof search, (4) balancing training data across domains, and (5) developing automated lemma retrieval for under-served fields, such as calculus.
7. Context, Impact, and Relationship to Other Benchmarks
FormalMATH situates itself among a new generation of rigorous, large-scale formal mathematics benchmarks. While earlier benchmarks (e.g., miniF2F (Zheng et al., 2021)) emphasized cross-system comparability on Olympiad-level problems, and FormalMATH incorporates challenging sub-benchmarks such as the OEIS-derived FormalMATH Inductive Subset (Gauthier et al., 2023), its broader domain and comprehensive pipeline provide a foundation for the next stage of automated mathematical reasoning.
Contemporary efforts such as FATE (Jiang et al., 4 Nov 2025) extend to research-level algebra and commutative algebra, pushing beyond the undergraduate and Olympiad scope by introducing novel definitions and requiring frontier-level abstraction, but with smaller problem counts. The FormalMATH construction pipeline, with its use of automated semantic verification and negation-based filtering, addresses the annotation scalability and semantic robustness required for such scale.
A plausible implication is that continued integration of automated synthesis, multi-model verification, and expert oversight—as exemplified by FormalMATH—will remain necessary to advance both the quality and coverage of formal problem corpora, especially as AI provers are expected to tackle more complex, research-level mathematical reasoning tasks.