miniF2F Benchmark for Automated Theorem Proving
- miniF2F is a cross-system benchmark comprising 488 manually curated Olympiad-level math problems formalized in multiple proof assistants.
- It standardizes evaluation across Lean, Metamath, Isabelle/HOL, and HOL Light by ensuring identical formal statement formats and rigorous verification.
- The benchmark drives advances in automated reasoning through methodological innovations like expert iteration, curriculum learning, and adaptive evaluation.
miniF2F is a cross-system benchmark of 488 formalized Olympiad-level mathematics problems, primarily intended to advance and measure automated theorem proving (ATP) and formal mathematical reasoning in interactive proof assistants. Collected from sources including AMC, AIME, IMO, as well as undergraduate-level assignments, these problems cover a wide range of algebra, number theory, inequalities, combinatorics, geometry, and basic calculus, and are manually formalized in Lean, Metamath, Isabelle, and (partially) HOL Light. miniF2F enables rigorous, apples-to-apples comparisons of ATP/ITP systems by standardizing formal statement formats and verification criteria, and is actively maintained with the goals of extensibility, portability, and fine-grained evaluation of both neural and symbolic provers.
1. Origins, Motivations, and Design
miniF2F, introduced by (Zheng et al., 2021), was motivated by the need for a unified, end-to-end-verifiable benchmark applicable across multiple proof assistant ecosystems. The benchmark integrates 488 manually-curated, Olympiad-style problems, with an explicit division into 244 test and 244 validation statements. Each entry is formalized in Lean, Metamath, and Isabelle/HOL, and optionally HOL Light, with all formalizations mechanically checked by the corresponding kernel. The design emphasizes:
- Cross-system comparability: Identical mathematical content in each formalism supports benchmarking across ATP and ITP platforms.
- Non-library, contest math flavor: Problems avoid reliance on large proof libraries, thus requiring models to “reason from scratch” and discouraging overfitting to system-specific primitives.
- Held-out “gold” test set: Test and validation splits prohibit ground-truth leakage, crucial for community-driven leaderboard development.
- Community extensibility: New problems, improved translations, and additional proof artifacts are invited and versioned.
2. Dataset Composition, Domains, and Formal Encodings
miniF2F draws from multiple sources per split (Zheng et al., 2021):
| Source | Test | Validation |
|---|---|---|
| IMO | 20 | 20 |
| AIME | 15 | 15 |
| AMC | 45 | 45 |
| MATH (Algebra levels) | 70 | 70 |
| MATH (NT levels) | 60 | 60 |
| Custom (extra/induct.) | 34 | 34 |
Problem domains include algebraic identities and inequalities, elementary real analysis, combinatorial and number-theoretic lemmas, as well as geometry (generally using coordinate methods due to the limitations of current Mathlib and its cousins (Viennot et al., 11 Feb 2025)).
Formal encodings are provided for major proof environments:
- Lean: Theorem-prover style with explicit type declarations and tactics, suitable for single-line proofs (e.g., use of
ring,simp, ornlinarith). - Metamath: Low-level kernel with substitution rules, producing long proof chains.
- Isabelle/HOL: Isar proof structures, invoking automation (
smt,auto, etc.). - Rocq: Recent high-fidelity translation (Viennot et al., 11 Feb 2025), validates benchmark portability.
3. Evaluation Protocols and Metrics
Evaluation centers on the “proof pass rate” or Pass@k metric: for each theorem, a prover generates k proof attempts under a fixed compute and time budget; Pass@k is the fraction of problems for which at least one attempt verifies in the proof assistant (Zheng et al., 2021, Xin et al., 15 Aug 2024, Zhang et al., 2 Feb 2025, Shang et al., 27 Jul 2025). For Lean, this means Lean’s kernel accepts the proof; no manual adjudication or "ad hoc" library search is used in scoring (Shang et al., 27 Jul 2025):
where n is the total proof attempts per theorem and c the number of successful completions. For Metamath and Isabelle/HOL, analogous kernel-based verification with fixed expansion and timeout budgets is employed (Zheng et al., 2021, Zhao et al., 20 Aug 2024). Average proof length (steps/tactics) and resource usage are also reported to measure efficiency (Zheng et al., 2021, Wischermann et al., 18 Jul 2025).
4. Advances and Limitations in miniF2F Modeling
Modern research leverages miniF2F as a principal testbed for advances in formal automated reasoning. Notable trends and methodologies include:
- Expert Iteration and Curriculum Learning (Polu et al., 2022): Alternating proof search and policy updates, combined with curricula of synthetic and curated statements, enable models to “climb” difficulty hierarchies without explicit ground-truth proofs, yielding 29.6%/36.6% Pass@1/Pass@64 on test.
- Tool-integrated, RL-based LLMs (Shang et al., 27 Jul 2025): StepFun-Prover’s fine-tuning with real-time Lean feedback attains 70.0% Pass@1 (32B), indicating the effectiveness of learning to “verify and reflect.”
- Subgoal-based and expert learning strategies (Zhao et al., 20 Aug 2024): Decomposition of large proofs into finer-grained subgoals, alternating expert and policy learning, improve both data efficiency and multi-step reasoning (Isabelle/HOL [email protected]% (Zhao et al., 20 Aug 2024)).
- Massive synthetic data generation (Lai et al., 17 May 2025): Tree-based exploration and adaptive beam search lead to 60.74% Pass@1, underscoring the importance of proof-state diversity at training time.
- Data quality and evaluation fidelity: Manual audits reveal major discrepancies in early releases; over 50% of formal statements in v1 were misaligned with their informal counterparts (Ospanov et al., 5 Nov 2025). The miniF2F-v2 update corrects all such errors, raising true end-to-end “Olympiad accuracy” to 70%, while highlighting the necessity of rigorous human verification and alignment.
Critical limitations persist, particularly on IMO-level problems, due to mismatched formalisms, missing hypotheses, and lack of explicit intermediate result scaffolding (Ospanov et al., 5 Nov 2025, Yousefzadeh et al., 28 Nov 2024). Even top models solve only a handful of full IMO problems (Yousefzadeh et al., 28 Nov 2024), and many LLMs fail at multi-step or “unknown synthesis” tasks—a deficiency addressed by the MiniF2F-Solving extension (Liu et al., 7 May 2025).
5. Extensions, Variations, and Analytical Frameworks
miniF2F’s core structure has spawned significant benchmark and evaluation innovations:
- Formal Problem-Solving Variant (MiniF2F-Solving) (Liu et al., 7 May 2025): Problems reformulated to require explicit solution synthesis (
∃ w, P(w)) and answer extraction, with correctness measured by Restricted Propositional Equivalence (RPE): equality must be provable in Lean using only restricted tactics (e.g.,rfl,norm_num,ring_nf,rw_search,aesop). - Automatic cross-assistant translation (Viennot et al., 11 Feb 2025): Demonstrated feasibility of near-complete machine translation of miniF2F into Rocq via LLMs, with up to 24 turns of error-correcting prompt feedback; 478/488 theorems were successfully ported and mechanically verified.
- Psychometric grading and adaptive evaluation (Zhang et al., 2 Feb 2025): Using item response theory (IRT)-based measures of “difficulty” and “discrimination”, miniF2F-Graded annotates each problem and supports adaptive, cost-efficient model assessment: only ≈24% of problems are necessary for reliable skill ranking, and item difficulty better aligns with LLM problem-solving behaviors than hand-crafted splits.
- IMO-proof step curriculum (Yousefzadeh et al., 28 Nov 2024): Initiative to formally decompose all 20 IMO miniF2F test problems (plus 3 recent IMOs) into over 1,329 lemmas (≳40k lines Lean), enabling fine-grained, failure-point–diagnosed development of ATP systems.
| Dataset/Extension | Key Feature | Reference |
|---|---|---|
| miniF2F-v2 | Full alignment of formal/informal statements | (Ospanov et al., 5 Nov 2025) |
| MiniF2F-Graded | IRT-based difficulty/discrimination annotation | (Zhang et al., 2 Feb 2025) |
| MiniF2F-Solving | Explicit answer synthesis, RPE verification | (Liu et al., 7 May 2025) |
| Rocq translation | LLM-based, multi-turn cross-assistant mapping | (Viennot et al., 11 Feb 2025) |
| IMO-Steps | Lemma decomposition of all IMOs in miniF2F | (Yousefzadeh et al., 28 Nov 2024) |
6. Empirical Performance Landscape and Current State of the Art
miniF2F remains the standard for reporting proof success rates in Lean, Isabelle, Metamath:
- Baseline (Lean+GPT-f+PACT/Olympiad, v1):
- Pass@1 = 24.6%; Pass@8 = 29.2% (Zheng et al., 2021)
- Expert Iteration/Value-guided (Lean, GPT-f):
- Pass@1 = 29.6%; Pass@64 = 36.6% (Polu et al., 2022)
- Massive Synthetic Data/RL/Tool Feedback (Lean 4):
- DeepSeek-Prover-V1.5-RL: Pass@32 = 63.5% (Xin et al., 15 Aug 2024)
- Goedel-Prover-DPO: Pass@32 = 63.5% (Lin et al., 11 Feb 2025)
- StepFun-Prover-Preview-32B: Pass@1 = 70.0% (Shang et al., 27 Jul 2025)
- Prover Agent: Pass@400 = 84.0% (SLM), Pass@2000 = 86.1% (Baba et al., 24 Jun 2025)
- Efficient Guided Approaches:
- ProofCompass: Pass@128 = 55.3% (with 25x fewer calls vs. baseline) (Wischermann et al., 18 Jul 2025)
- Isabelle/HOL (SubgoalXL):
- Pass@test = 56.1%; 3/20 IMO problems solved (absolute contest-level progress matches Lean and lags only at ultra-long proofs) (Zhao et al., 20 Aug 2024)
Empirical consensus indicates that fine-tuned LLMs attain 60–70% pass rates on v1/v2 under moderate sampling (k=32–128), but only approach or surpass 70% with step-integrated CoT, tool feedback, and large-scale synthetic training (Shang et al., 27 Jul 2025). End-to-end inform→formal→proof (“Olympiad accuracy”) remains considerably lower unless formal/informal misalignments are rectified (Ospanov et al., 5 Nov 2025).
7. Impact, Community Practices, and Future Directions
miniF2F is central to both benchmarking and methodological development in mathematical ATP/ITP. Its multi-system design underpins reproducibility efforts and drives advances in formal statement translation, proof search, curriculum learning, synthetic data generation, and adaptive evaluation strategies. Key implications and future directions include:
- Necessity of benchmark rigor: Human verification and formal/informal statement alignment are required for meaningful advances in autoformalization and ATP (Ospanov et al., 5 Nov 2025).
- Greater diagnostic resolution: Fine-grained lemma-based and adaptive testing methodologies enable precise identification of LLM/prover strengths and failure modes (Yousefzadeh et al., 28 Nov 2024, Zhang et al., 2 Feb 2025).
- Portability and ecosystem bridging: Automated translation pipelines (e.g., to Rocq) demonstrate the potential of semi-automatic cross-system expansion (Viennot et al., 11 Feb 2025).
- Move beyond theorem proving to problem-solving: Recasting ATP as answer synthesis (miniF2F-Solving) brings the field closer to fully-automated mathematics problem solving (Liu et al., 7 May 2025).
- IMO-Grand Challenge framing: The end goal of a system capable of solving all IMO formalizations—start-to-finish, with independently verified translations and proofs—anchors ongoing research (Yousefzadeh et al., 28 Nov 2024).
miniF2F continues to evolve as a focal point for ATP innovation, with ongoing expansions of v2, translations, graded/task variants, and step-annotated proof datasets. The benchmark’s evolution exemplifies the increasingly meticulous methodological standards and diagnostic sophistication adopted across the automated mathematics community.