ExamFormal-Bench: Formal Reasoning Benchmark

Updated 24 November 2025

ExamFormal-Bench is a benchmark suite that rigorously evaluates LLMs and theorem provers on translating informal math problems into formal Lean 4 theorems.
It comprises 402 exam-level problems from real competitions across multiple domains, ensuring diversity and balanced coverage.
Evaluation uses pass@k metrics for proof synthesis and autoformalization, promoting transparency and reproducibility in formal reasoning research.

ExamFormal-Bench is a publicly released benchmark suite for evaluating the formal reasoning and autoformalization capabilities of LLMs and formal theorem provers, with a particular focus on translation from informal (natural-language) mathematical statements to fully formal Lean 4 theorems, and on proof synthesis for competition-level mathematics. As of 2025, the term "ExamFormal-Bench" denotes both a set of formally aligned math problem datasets—most notably the 402-problem suite curated for Spark-Prover-X1—and a programmatic agenda for raising standards of completeness, transparency, and reproducibility in formal reasoning benchmarks (Zhou et al., 17 Nov 2025, Gulati et al., 1 Jun 2024, Yousefzadeh et al., 7 Jul 2025, Zheng et al., 2021).

1. Origins and Motivations

The design and deployment of ExamFormal-Bench responds to the critical need for rigorous, system-agnostic benchmarks in neural theorem proving and autoformalization. Earlier datasets such as miniF2F established the paradigm of cross-system alignment (e.g., Lean, Metamath, HOL Light, Isabelle) for 488 Olympiad and undergraduate-level problems, enabling apples-to-apples comparison across different proof assistants (Zheng et al., 2021). However, prior benchmarks suffered from limitations: incomplete pairing of informal and formal statements, lack of ground-truth proofs, small scale, and insufficient error-tracking. ExamFormal-Bench builds directly on these foundations by providing a single, human-vetted, multimodal test suite to evaluate real-world exam performance of LLM-based formal reasoning agents (Zhou et al., 17 Nov 2025, Yousefzadeh et al., 7 Jul 2025).

2. Dataset Structure and Construction

ExamFormal-Bench (in its canonical form, as released by Spark-Prover-X1) comprises 402 independently formalized Lean 4 theorem statements. Each problem is derived from real-world mathematical competitions and university qualifying exams, spanning three academic levels: middle school, high school, and undergraduate (Zhou et al., 17 Nov 2025). Problems were collected from official transcripts, processed via manual OCR and normalization, auto-formalized by an ensemble of LLM formalizers, and subjected to rigorous human review. Near-duplicate items (cosine similarity > 90%) were merged, resulting in a topic-balanced and non-redundant benchmark.

Problems are distributed evenly across six major domains:

Analysis
Geometry
Algebra
Probability & Statistics
Computational Mathematics
Discrete Mathematics (including combinatorics)

Each problem includes a Lean theorem statement (with any required imports and a proof skeleton), adhering to Mathlib4 compatibility and verified for successful Lean 4 compilation (Zhou et al., 17 Nov 2025).

3. Evaluation Protocols and Metrics

Benchmarking on ExamFormal-Bench uses the pass@k metric for proof synthesis: if a prover emits $k$ independent proof candidates for a theorem, $\mathrm{pass@}k$ is the probability that at least one compiles to a complete Lean 4 proof:

$\mathrm{pass@}k = 1 - \prod_{i=1}^{k} (1 - n_\mathrm{correct}/n_\mathrm{total})$

The default evaluation regime fixes $k=32$ (pass@32) and all benchmarking is done in "whole-proof" mode under this computational budget. No train/val/test splits are introduced; the entire set is held out for evaluation (Zhou et al., 17 Nov 2025).

Representative baseline results (pass@32):

Prover	Pass@32 (%)
Spark-Prover-X1-7B	51.2
DeepSeek-Prover-V2-7B	49.0
Gödel-Prover-V2-8B	48.8
Kimina-Prover-Distill-8B	45.3

For statement autoformalization, pass@8 is standard, and the strongest autoformalizers (Kimina-Formalizer-7B, Gödel-Formalizer-V2-8B, Spark-Formalizer-X1-7B) exceed 97% (Zhou et al., 17 Nov 2025).

4. Problem Taxonomy and Examples

Uniquely, ExamFormal-Bench tags problems with lightweight ontologies (e.g., “inequalities,” “induction,” “counting,” “graph-theoretic,” “integral,” “matrix”). This enables stratified analysis and uniform sampling (Zhou et al., 17 Nov 2025).

Example 1: Algebra (Inequality)

import Mathlib.Analysis.SpecialFunctions.Log
theorem examformal_algebra_ineq
  {a b c : ℝ} (ha : 0 < a) (hb : 0 < b) (hc : 0 < c) :
  (a + b + c) * (1 / a + 1 / b + 1 / c) ≥ 9 := by
  -- proof by AM–HM
  have h₁ : (a+b+c) / 3 ≥ 3 / (1/a + 1/b + 1/c) := by
    simpa only [div_eq_inv_mul] using (AM_le_HM (· / ·) 3 (fun i => ?_)).reverse
  simpa [mul_div_right_comm] using (mul_le_mul' h₁ (le_rfl : _ ≤ 1) (add_nonneg ha.lb hb.lb) (add_nonneg ha.lb hb.lb))

Example 2: Discrete Mathematics (Combinatorics)

import Mathlib.Combinatorics.SimpleGraph
import Mathlib.Data.Finset.Basic
theorem examformal_combi_count {n k : ℕ} (hk : k ≤ n) :
  (finset.range n).card.choose k = n.choose k := by
  -- proof by properties of finset.card and choose
  simp only [Finset.card_range, choose_eq_binomial]

5. Methodological Principles and Completeness

ExamFormal-Bench exemplifies benchmark completeness as defined in (Yousefzadeh et al., 7 Jul 2025): a complete formal reasoning benchmark contains four quadrants—(i) informal problem statements, (ii) formal problem statements, (iii) informal proofs, and (iv) formal proofs. While the initial dataset comprises only (i) and (ii), best practices outlined for future versions call for paired proofs in both natural language and Lean, with all statements and proofs verified for correctness and tracked for any error rate:

$\text{ErrorRate} = \frac{E}{N}$

where $E$ is the number of mistakes and $N$ the total number of theorems. Openness is enforced by releasing the dataset and code under OSI-approved licenses, providing all evaluation scripts and Docker-based reproducibility, and instituting CI pipelines for continuous verification (Yousefzadeh et al., 7 Jul 2025).

6. Comparative Context and Impact

ExamFormal-Bench situates itself among a growing ecosystem of formal reasoning benchmarks:

miniF2F: 488 problems, cross-system alignment, no paired informal proofs (Zheng et al., 2021).
PutnamBench, ProofNet: undergraduate/Putnam-level problems, typically only formal statements released.
FATE series: targets graduate and research-level algebra, with FATE-H/X explicitly pushing beyond contest math (Jiang et al., 4 Nov 2025).
FormalMATH: large-scale (5560 problems), human-in-the-loop formalization pipeline, highlights autoformalization and scaling challenges (Yu et al., 5 May 2025).
VeriEquivBench/FormalSpecCpp: focus on C++ and Dafny program verification, not theorem proving, but reflects similar completeness/openness trends (Zeng et al., 7 Oct 2025, Chakraborty et al., 21 Feb 2025).

ExamFormal-Bench is distinctive in its exclusive focus on real exam problems, fine-grained topic balancing, rigorous LLM + human curation, and in serving as a realistic test for models with practical exam-focused ambitions.

7. Future Directions and Best Practices

Key improvement paths identified (Yousefzadeh et al., 7 Jul 2025, Gulati et al., 1 Jun 2024):

Augmenting the benchmark with full informal and formal proofs for every problem to achieve strict completeness.
Assigning detailed difficulty metadata and richer topic ontologies.
Continuous community-driven addition of new problems and erratum-tracked patching.
Locking verification to a stable Lean version and Docker image for environmental reproducibility.
Publishing all model submissions, proofs, and metadata for direct ablation and secondary analysis.
Addressing version drift and overfitting by maintaining a hidden subset reserved for final evaluation.

Adoption of these principles positions ExamFormal-Bench as a reference “exam bench” for formal mathematics reasoning, automated grading, and development of next-generation formal theorem provers.

References:

(Zhou et al., 17 Nov 2025) Spark-Prover-X1: Formal Theorem Proving Through Diverse Data Training
(Gulati et al., 1 Jun 2024) An Evaluation Benchmark for Autoformalization in Lean4
(Yousefzadeh et al., 7 Jul 2025) Advocate for Complete Benchmarks for Formal Reasoning with Formal/Informal Statements and Formal/Informal Proofs
(Zheng et al., 2021) MiniF2F: a cross-system benchmark for formal Olympiad-level mathematics
(Yu et al., 5 May 2025) FormalMATH: Benchmarking Formal Mathematical Reasoning of LLMs