Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash 91 tok/s
Gemini 2.5 Pro 49 tok/s Pro
GPT-5 Medium 31 tok/s
GPT-5 High 36 tok/s Pro
GPT-4o 95 tok/s
GPT OSS 120B 478 tok/s Pro
Kimi K2 223 tok/s Pro
2000 character limit reached

MiniF2F Benchmark

Updated 1 July 2025
  • MiniF2F is a benchmark with 488 formalized Olympiad-level math problems across multiple proof systems, used to evaluate automated theorem provers and language models.
  • MiniF2F serves as a standard testbed driving advances in neural theorem proving and formal mathematics AI, enabling evaluation of methods that achieve high pass rates on complex problems.
  • While strong in algebra and number theory, MiniF2F has limitations in areas like geometry and system comparability, with ongoing work targeting broader coverage and advanced AI reasoning.

The MiniF2F benchmark is a rigorously designed, cross-system evaluation suite for automated systems—including LLMs and neural-guided provers—on formalized, Olympiad-level mathematical problem statements. MiniF2F has become a foundational resource for measuring substantive advances in neural theorem proving, particularly in the context of bridging the gap between informal mathematical reasoning and the stringent demands of formal proof environments.

1. Definition, Scope, and Motivation

MiniF2F is a curated benchmark of 488 high-school and undergraduate mathematics problems formalized across multiple proof assistant systems, targeting challenging domains such as algebra, number theory, and inequalities. Its construction is explicitly intended to enable unified, end-to-end verifiable, and cross-system evaluation of theorem proving methods, both neural and symbolic (Zheng et al., 2021). Prior domain-specific datasets (e.g., LeanStep, HOList, CoqGym) lacked coverage of mathematically meaningful, Olympiad-level problems in a system-agnostic format; MiniF2F addresses this by aligning carefully formalized problems across Lean, Metamath, and, with ongoing extensions, Isabelle and HOL Light.

Benchmark problems are sourced from:

  • International Mathematical Olympiad (IMO)
  • American Invitational Mathematics Examination (AIME)
  • American Mathematics Competitions (AMC)
  • High-school mathematics textbooks and undergraduate courses
  • Select problems stratified by difficulty from the MATH dataset

The inclusion of genuine Olympiad problems, with formal solutions required, distinguishes MiniF2F from earlier benchmarks that emphasized internal library lemmas or textbook-derived exercises.

2. Structure, Formalization, and System Coverage

Each MiniF2F problem is expressed as a formal proposition in multiple systems. For instance, a typical Lean formalization for an AMC item is:

1
2
3
4
5
6
7
8
9
10
theorem amc12_2000_p11
  (a b : ℝ)
  (h₀ : a ≠ 0 ∧ b ≠ 0)
  (h₁ : a * b = a - b) :
  a / b + b / a - a * b = 2 :=
begin
  field_simp [h₀.1, h₀.2],
  simp only [h₁, mul_comm, mul_sub],
  ring,
end

Problems are manually curated to ensure near-identical semantics across systems, despite variations in foundational logic or expressivity. Full support is provided for Lean and Metamath; Isabelle and HOL Light are partially covered, with ongoing efforts toward Coq and further systems (Zheng et al., 2021). Problems that require witness construction (e.g., "find all X such that…") are generally rephrased as verification tasks for given candidate solutions due to system alignment constraints.

Subject focus in v1 is on algebra, elementary number theory, and inequalities. Geometry and combinatorics are underrepresented due to current difficulties in cross-system formal expression but are targeted for future expansions.

3. Evaluation Methodology and Baseline Results

To allow direct performance comparison, MiniF2F adopts a pass rate metric: the proportion of problems for which the system produces a formally verified proof. In neural theorem proving contexts, this is typically articulated as Pass@N:

Pass@N=#problems with at least one successful proof in N attemptsTotal number of problems\text{Pass@}N = \frac{\#\text{problems with at least one successful proof in }N\text{ attempts}}{\text{Total number of problems}}

Key systems and methods for miniF2F v1 include:

  • Lean (Lean 3/4): Using "tidy" (a heuristic best-first tactic search baseline) and GPT-f/PACT (a GPT-3-based approach finetuned for Lean tactic generation).
  • Metamath: Using GPT-f with only low-level inference rules available.
  • Isabelle/HOL Light: Infrastructure is ready; baseline results are pending or partial.

Summary of baseline results:

System Model Avg. Proof Length Pass@1 Pass@8
Metamath GPT-f 20.3 1.3% 1.6%
Lean tidy 1.8 18.0%
Lean GPT-f/PACT 2.5 24.6% 29.2%

Proof complexity is system-dependent; Lean's high-level tactics (e.g. linarith, ring, nlinarith) allow short proofs for classes of problems, while Metamath's lower-level approach leads to longer, more challenging proof search tasks for neural models.

4. Advances, Methodologies, and Recent Results

Subsequent research has built on the MiniF2F foundation, yielding notable advances:

  • Curriculum learning and expert iteration: Iteratively combining proof search with self-generated data and fine-tuning dramatically increases closure rates on MiniF2F, especially on harder problems. Models trained this way have achieved up to 41.2% (Pass@8, validation) and 34.5% (test) on MiniF2F with the full synthetic and manually curated curriculum (Polu et al., 2022).
  • Dynamic sampling and data augmentation: Neural provers employing dynamic sampling (e.g., DS-Prover) to vary tactic search width and systematic decomposition of tactics into finer units achieve Pass@1 scores up to 29.8%, with a further union of diverse methods reaching 31.4% on the test set (Vishwakarma et al., 2023).
  • Large-scale synthetic data and tree search: Data synthesis frameworks generating vast numbers of formal problems and solutions—coupled with critic-guided tree search algorithms—drive state-of-the-art scores above 68% (Pass, test) in Lean (Li et al., 30 Dec 2024). Hierarchical "distance" critics and MCTS/BFS-based search have proven especially effective for "system 2" (slow, deliberate) reasoning.
  • Agent-based architectures with auxiliary lemma generation: Integrating informal reasoning LLMs, formal autoformalizers, and Lean feedback, Prover Agent reaches 86.1% success on MiniF2F with 8B-parameter models and fewer sampled attempts, by generating and verifying auxiliary lemmas as stepping stones to final proofs (Baba et al., 24 Jun 2025).
  • Subgoal-based expert learning: SubgoalXL leverages subgoal decomposition and iterative, probabilistic expert learning to reach 56.1% pass rates in Isabelle on MiniF2F, suggesting substantial data efficiency and improved multi-step reasoning (Zhao et al., 20 Aug 2024).

Table: Select advanced baseline results (Pass/test, Lean unless specified)

System Approach Pass Rate (%)
GPT-f/PACT Lean, expert iteration 29–34 (Pass@8)
DS-Prover Dynamic tactic sampling 29.8 (Pass@1)
HunyuanProver BFS+Distance Critic 68.4
Prover Agent Agent-based, SLMs, lemma gen 86.1
SubgoalXL Isabelle, subgoal-based learning 56.1

5. Roles in System Interoperability and Dataset Extension

MiniF2F's system-agnostic design allows:

  • Translation for cross-assistant research: LLMs have been successfully used to translate almost all MiniF2F statements into Rocq, yielding an open-source resource for formal system interoperability and benchmarking (Viennot et al., 11 Feb 2025).
  • Problem-solving extensions: FPS (Formal Problem-Solving) adapts MiniF2F to directly measure AI "solving" of unknowns (not just verification), with formal correctness defined via the RPE (Restricted Propositional Equivalence) metric (Liu et al., 7 May 2025).
  • Psychometric grading: Recent methodology grades MiniF2F problems by LLM-observed difficulty and discrimination, enabling adaptive evaluation and reducing evaluation costs, while better reflecting model capability differences (Zhang et al., 2 Feb 2025).
  • Rich lemma-level datasets: Manual formalization of previously unsolved MiniF2F IMO problems and their nontrivial lemma decompositions now provides a rigorous testbed for stepwise evaluation and diagnosis of AI models at Olympiad level (Yousefzadeh et al., 28 Nov 2024).

6. Challenges, Limitations, and Future Directions

Key technical and empirical limitations include:

  • Subject coverage imbalance: Geometry and combinatorics are less represented due to formalization challenges; expansion is an explicit target.
  • System feature imbalances: High-level Lean tactics lower proof lengths, while systems without such tactics pose stiffer challenges for LLMs—impacting comparability.
  • Difficulty segmentation: Hardest MiniF2F problems, especially IMO-level, remain largely beyond current LLM and ATP system capabilities, as evidenced by very low pass rates without auxiliary lemma or subgoal generation.
  • Alignment and expressive evaluation: Witness construction, multi-choice problems, and separation between answer discovery and verification motivate ongoing methodology and benchmark updates.
  • Reliance on synthetic data: Recent SOTA systems train on millions of automatically generated instances, but the transferability of such synthetic data to genuinely novel Olympiad problems is an open investigation.

Anticipated future directions are:

  • Full coverage for Coq, Isabelle, and other systems.
  • Broader subject/domain representation as libraries mature.
  • Advanced curriculum and automated data/lemma synthesis.
  • Deeper integration of informal-formal reasoning bridges.
  • Community-driven growth in formal proof datasets and evaluation practices.

7. Significance and Impact

MiniF2F anchors the recent surge in neural formal mathematics, providing an objective and challenging testbed for algorithmic progress. It has catalyzed diverse advances, including hybrid neuro-symbolic architectures, scalable synthetic data, subgoal-based learning, cross-system translation, and problem-solving benchmarks. By specifying rigorous, cross-system, and verifiable goals, it directly aligns AI research with the IMO Grand Challenge—producing formal, checkable proofs of world-level Olympiad mathematics—and continues to serve as a barometer for progress toward AI agents that can rival or surpass top human mathematical ability.