MiniF2F Benchmark
- MiniF2F is a benchmark with 488 formalized Olympiad-level math problems across multiple proof systems, used to evaluate automated theorem provers and language models.
- MiniF2F serves as a standard testbed driving advances in neural theorem proving and formal mathematics AI, enabling evaluation of methods that achieve high pass rates on complex problems.
- While strong in algebra and number theory, MiniF2F has limitations in areas like geometry and system comparability, with ongoing work targeting broader coverage and advanced AI reasoning.
The MiniF2F benchmark is a rigorously designed, cross-system evaluation suite for automated systems—including LLMs and neural-guided provers—on formalized, Olympiad-level mathematical problem statements. MiniF2F has become a foundational resource for measuring substantive advances in neural theorem proving, particularly in the context of bridging the gap between informal mathematical reasoning and the stringent demands of formal proof environments.
1. Definition, Scope, and Motivation
MiniF2F is a curated benchmark of 488 high-school and undergraduate mathematics problems formalized across multiple proof assistant systems, targeting challenging domains such as algebra, number theory, and inequalities. Its construction is explicitly intended to enable unified, end-to-end verifiable, and cross-system evaluation of theorem proving methods, both neural and symbolic (MiniF2F: a cross-system benchmark for formal Olympiad-level mathematics, 2021). Prior domain-specific datasets (e.g., LeanStep, HOList, CoqGym) lacked coverage of mathematically meaningful, Olympiad-level problems in a system-agnostic format; MiniF2F addresses this by aligning carefully formalized problems across Lean, Metamath, and, with ongoing extensions, Isabelle and HOL Light.
Benchmark problems are sourced from:
- International Mathematical Olympiad (IMO)
- American Invitational Mathematics Examination (AIME)
- American Mathematics Competitions (AMC)
- High-school mathematics textbooks and undergraduate courses
- Select problems stratified by difficulty from the MATH dataset
The inclusion of genuine Olympiad problems, with formal solutions required, distinguishes MiniF2F from earlier benchmarks that emphasized internal library lemmas or textbook-derived exercises.
2. Structure, Formalization, and System Coverage
Each MiniF2F problem is expressed as a formal proposition in multiple systems. For instance, a typical Lean formalization for an AMC item is:
1 2 3 4 5 6 7 8 9 10 |
theorem amc12_2000_p11 (a b : ℝ) (h₀ : a ≠ 0 ∧ b ≠ 0) (h₁ : a * b = a - b) : a / b + b / a - a * b = 2 := begin field_simp [h₀.1, h₀.2], simp only [h₁, mul_comm, mul_sub], ring, end |
Problems are manually curated to ensure near-identical semantics across systems, despite variations in foundational logic or expressivity. Full support is provided for Lean and Metamath; Isabelle and HOL Light are partially covered, with ongoing efforts toward Coq and further systems (MiniF2F: a cross-system benchmark for formal Olympiad-level mathematics, 2021). Problems that require witness construction (e.g., "find all X such that…") are generally rephrased as verification tasks for given candidate solutions due to system alignment constraints.
Subject focus in v1 is on algebra, elementary number theory, and inequalities. Geometry and combinatorics are underrepresented due to current difficulties in cross-system formal expression but are targeted for future expansions.
3. Evaluation Methodology and Baseline Results
To allow direct performance comparison, MiniF2F adopts a pass rate metric: the proportion of problems for which the system produces a formally verified proof. In neural theorem proving contexts, this is typically articulated as Pass@N:
Key systems and methods for miniF2F v1 include:
- Lean (Lean 3/4): Using "tidy" (a heuristic best-first tactic search baseline) and GPT-f/PACT (a GPT-3-based approach finetuned for Lean tactic generation).
- Metamath: Using GPT-f with only low-level inference rules available.
- Isabelle/HOL Light: Infrastructure is ready; baseline results are pending or partial.
Summary of baseline results:
System | Model | Avg. Proof Length | Pass@1 | Pass@8 |
---|---|---|---|---|
Metamath | GPT-f | 20.3 | 1.3% | 1.6% |
Lean | tidy | 1.8 | 18.0% | — |
Lean | GPT-f/PACT | 2.5 | 24.6% | 29.2% |
Proof complexity is system-dependent; Lean's high-level tactics (e.g. linarith
, ring
, nlinarith
) allow short proofs for classes of problems, while Metamath's lower-level approach leads to longer, more challenging proof search tasks for neural models.
4. Advances, Methodologies, and Recent Results
Subsequent research has built on the MiniF2F foundation, yielding notable advances:
- Curriculum learning and expert iteration: Iteratively combining proof search with self-generated data and fine-tuning dramatically increases closure rates on MiniF2F, especially on harder problems. Models trained this way have achieved up to 41.2% (Pass@8, validation) and 34.5% (test) on MiniF2F with the full synthetic and manually curated curriculum (Formal Mathematics Statement Curriculum Learning, 2022).
- Dynamic sampling and data augmentation: Neural provers employing dynamic sampling (e.g., DS-Prover) to vary tactic search width and systematic decomposition of tactics into finer units achieve Pass@1 scores up to 29.8%, with a further union of diverse methods reaching 31.4% on the test set (Enhancing Neural Theorem Proving through Data Augmentation and Dynamic Sampling Method, 2023).
- Large-scale synthetic data and tree search: Data synthesis frameworks generating vast numbers of formal problems and solutions—coupled with critic-guided tree search algorithms—drive state-of-the-art scores above 68% (Pass, test) in Lean (HUNYUANPROVER: A Scalable Data Synthesis Framework and Guided Tree Search for Automated Theorem Proving, 30 Dec 2024). Hierarchical "distance" critics and MCTS/BFS-based search have proven especially effective for "system 2" (slow, deliberate) reasoning.
- Agent-based architectures with auxiliary lemma generation: Integrating informal reasoning LLMs, formal autoformalizers, and Lean feedback, Prover Agent reaches 86.1% success on MiniF2F with 8B-parameter models and fewer sampled attempts, by generating and verifying auxiliary lemmas as stepping stones to final proofs (Prover Agent: An Agent-based Framework for Formal Mathematical Proofs, 24 Jun 2025).
- Subgoal-based expert learning: SubgoalXL leverages subgoal decomposition and iterative, probabilistic expert learning to reach 56.1% pass rates in Isabelle on MiniF2F, suggesting substantial data efficiency and improved multi-step reasoning (SubgoalXL: Subgoal-based Expert Learning for Theorem Proving, 20 Aug 2024).
Table: Select advanced baseline results (Pass/test, Lean unless specified)
System | Approach | Pass Rate (%) |
---|---|---|
GPT-f/PACT | Lean, expert iteration | 29–34 (Pass@8) |
DS-Prover | Dynamic tactic sampling | 29.8 (Pass@1) |
HunyuanProver | BFS+Distance Critic | 68.4 |
Prover Agent | Agent-based, SLMs, lemma gen | 86.1 |
SubgoalXL | Isabelle, subgoal-based learning | 56.1 |
5. Roles in System Interoperability and Dataset Extension
MiniF2F's system-agnostic design allows:
- Translation for cross-assistant research: LLMs have been successfully used to translate almost all MiniF2F statements into Rocq, yielding an open-source resource for formal system interoperability and benchmarking (MiniF2F in Rocq: Automatic Translation Between Proof Assistants -- A Case Study, 11 Feb 2025).
- Problem-solving extensions: FPS (Formal Problem-Solving) adapts MiniF2F to directly measure AI "solving" of unknowns (not just verification), with formal correctness defined via the RPE (Restricted Propositional Equivalence) metric (Beyond Theorem Proving: Formulation, Framework and Benchmark for Formal Problem-Solving, 7 May 2025).
- Psychometric grading: Recent methodology grades MiniF2F problems by LLM-observed difficulty and discrimination, enabling adaptive evaluation and reducing evaluation costs, while better reflecting model capability differences (Psychometric-Based Evaluation for Theorem Proving with Large Language Models, 2 Feb 2025).
- Rich lemma-level datasets: Manual formalization of previously unsolved MiniF2F IMO problems and their nontrivial lemma decompositions now provides a rigorous testbed for stepwise evaluation and diagnosis of AI models at Olympiad level (A Lean Dataset for International Math Olympiad: Small Steps towards Writing Math Proofs for Hard Problems, 28 Nov 2024).
6. Challenges, Limitations, and Future Directions
Key technical and empirical limitations include:
- Subject coverage imbalance: Geometry and combinatorics are less represented due to formalization challenges; expansion is an explicit target.
- System feature imbalances: High-level Lean tactics lower proof lengths, while systems without such tactics pose stiffer challenges for LLMs—impacting comparability.
- Difficulty segmentation: Hardest MiniF2F problems, especially IMO-level, remain largely beyond current LLM and ATP system capabilities, as evidenced by very low pass rates without auxiliary lemma or subgoal generation.
- Alignment and expressive evaluation: Witness construction, multi-choice problems, and separation between answer discovery and verification motivate ongoing methodology and benchmark updates.
- Reliance on synthetic data: Recent SOTA systems train on millions of automatically generated instances, but the transferability of such synthetic data to genuinely novel Olympiad problems is an open investigation.
Anticipated future directions are:
- Full coverage for Coq, Isabelle, and other systems.
- Broader subject/domain representation as libraries mature.
- Advanced curriculum and automated data/lemma synthesis.
- Deeper integration of informal-formal reasoning bridges.
- Community-driven growth in formal proof datasets and evaluation practices.
7. Significance and Impact
MiniF2F anchors the recent surge in neural formal mathematics, providing an objective and challenging testbed for algorithmic progress. It has catalyzed diverse advances, including hybrid neuro-symbolic architectures, scalable synthetic data, subgoal-based learning, cross-system translation, and problem-solving benchmarks. By specifying rigorous, cross-system, and verifiable goals, it directly aligns AI research with the IMO Grand Challenge—producing formal, checkable proofs of world-level Olympiad mathematics—and continues to serve as a barometer for progress toward AI agents that can rival or surpass top human mathematical ability.