Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
126 tokens/sec
GPT-4o
47 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

MiniF2F Benchmark

Updated 1 July 2025
  • MiniF2F is a benchmark with 488 formalized Olympiad-level math problems across multiple proof systems, used to evaluate automated theorem provers and language models.
  • MiniF2F serves as a standard testbed driving advances in neural theorem proving and formal mathematics AI, enabling evaluation of methods that achieve high pass rates on complex problems.
  • While strong in algebra and number theory, MiniF2F has limitations in areas like geometry and system comparability, with ongoing work targeting broader coverage and advanced AI reasoning.

The MiniF2F benchmark is a rigorously designed, cross-system evaluation suite for automated systems—including LLMs and neural-guided provers—on formalized, Olympiad-level mathematical problem statements. MiniF2F has become a foundational resource for measuring substantive advances in neural theorem proving, particularly in the context of bridging the gap between informal mathematical reasoning and the stringent demands of formal proof environments.

1. Definition, Scope, and Motivation

MiniF2F is a curated benchmark of 488 high-school and undergraduate mathematics problems formalized across multiple proof assistant systems, targeting challenging domains such as algebra, number theory, and inequalities. Its construction is explicitly intended to enable unified, end-to-end verifiable, and cross-system evaluation of theorem proving methods, both neural and symbolic (MiniF2F: a cross-system benchmark for formal Olympiad-level mathematics, 2021). Prior domain-specific datasets (e.g., LeanStep, HOList, CoqGym) lacked coverage of mathematically meaningful, Olympiad-level problems in a system-agnostic format; MiniF2F addresses this by aligning carefully formalized problems across Lean, Metamath, and, with ongoing extensions, Isabelle and HOL Light.

Benchmark problems are sourced from:

  • International Mathematical Olympiad (IMO)
  • American Invitational Mathematics Examination (AIME)
  • American Mathematics Competitions (AMC)
  • High-school mathematics textbooks and undergraduate courses
  • Select problems stratified by difficulty from the MATH dataset

The inclusion of genuine Olympiad problems, with formal solutions required, distinguishes MiniF2F from earlier benchmarks that emphasized internal library lemmas or textbook-derived exercises.

2. Structure, Formalization, and System Coverage

Each MiniF2F problem is expressed as a formal proposition in multiple systems. For instance, a typical Lean formalization for an AMC item is:

1
2
3
4
5
6
7
8
9
10
theorem amc12_2000_p11
  (a b : ℝ)
  (h₀ : a ≠ 0 ∧ b ≠ 0)
  (h₁ : a * b = a - b) :
  a / b + b / a - a * b = 2 :=
begin
  field_simp [h₀.1, h₀.2],
  simp only [h₁, mul_comm, mul_sub],
  ring,
end

Problems are manually curated to ensure near-identical semantics across systems, despite variations in foundational logic or expressivity. Full support is provided for Lean and Metamath; Isabelle and HOL Light are partially covered, with ongoing efforts toward Coq and further systems (MiniF2F: a cross-system benchmark for formal Olympiad-level mathematics, 2021). Problems that require witness construction (e.g., "find all X such that…") are generally rephrased as verification tasks for given candidate solutions due to system alignment constraints.

Subject focus in v1 is on algebra, elementary number theory, and inequalities. Geometry and combinatorics are underrepresented due to current difficulties in cross-system formal expression but are targeted for future expansions.

3. Evaluation Methodology and Baseline Results

To allow direct performance comparison, MiniF2F adopts a pass rate metric: the proportion of problems for which the system produces a formally verified proof. In neural theorem proving contexts, this is typically articulated as Pass@N:

Pass@N=#problems with at least one successful proof in N attemptsTotal number of problems\text{Pass@}N = \frac{\#\text{problems with at least one successful proof in }N\text{ attempts}}{\text{Total number of problems}}

Key systems and methods for miniF2F v1 include:

  • Lean (Lean 3/4): Using "tidy" (a heuristic best-first tactic search baseline) and GPT-f/PACT (a GPT-3-based approach finetuned for Lean tactic generation).
  • Metamath: Using GPT-f with only low-level inference rules available.
  • Isabelle/HOL Light: Infrastructure is ready; baseline results are pending or partial.

Summary of baseline results:

System Model Avg. Proof Length Pass@1 Pass@8
Metamath GPT-f 20.3 1.3% 1.6%
Lean tidy 1.8 18.0%
Lean GPT-f/PACT 2.5 24.6% 29.2%

Proof complexity is system-dependent; Lean's high-level tactics (e.g. linarith, ring, nlinarith) allow short proofs for classes of problems, while Metamath's lower-level approach leads to longer, more challenging proof search tasks for neural models.

4. Advances, Methodologies, and Recent Results

Subsequent research has built on the MiniF2F foundation, yielding notable advances:

Table: Select advanced baseline results (Pass/test, Lean unless specified)

System Approach Pass Rate (%)
GPT-f/PACT Lean, expert iteration 29–34 (Pass@8)
DS-Prover Dynamic tactic sampling 29.8 (Pass@1)
HunyuanProver BFS+Distance Critic 68.4
Prover Agent Agent-based, SLMs, lemma gen 86.1
SubgoalXL Isabelle, subgoal-based learning 56.1

5. Roles in System Interoperability and Dataset Extension

MiniF2F's system-agnostic design allows:

6. Challenges, Limitations, and Future Directions

Key technical and empirical limitations include:

  • Subject coverage imbalance: Geometry and combinatorics are less represented due to formalization challenges; expansion is an explicit target.
  • System feature imbalances: High-level Lean tactics lower proof lengths, while systems without such tactics pose stiffer challenges for LLMs—impacting comparability.
  • Difficulty segmentation: Hardest MiniF2F problems, especially IMO-level, remain largely beyond current LLM and ATP system capabilities, as evidenced by very low pass rates without auxiliary lemma or subgoal generation.
  • Alignment and expressive evaluation: Witness construction, multi-choice problems, and separation between answer discovery and verification motivate ongoing methodology and benchmark updates.
  • Reliance on synthetic data: Recent SOTA systems train on millions of automatically generated instances, but the transferability of such synthetic data to genuinely novel Olympiad problems is an open investigation.

Anticipated future directions are:

  • Full coverage for Coq, Isabelle, and other systems.
  • Broader subject/domain representation as libraries mature.
  • Advanced curriculum and automated data/lemma synthesis.
  • Deeper integration of informal-formal reasoning bridges.
  • Community-driven growth in formal proof datasets and evaluation practices.

7. Significance and Impact

MiniF2F anchors the recent surge in neural formal mathematics, providing an objective and challenging testbed for algorithmic progress. It has catalyzed diverse advances, including hybrid neuro-symbolic architectures, scalable synthetic data, subgoal-based learning, cross-system translation, and problem-solving benchmarks. By specifying rigorous, cross-system, and verifiable goals, it directly aligns AI research with the IMO Grand Challenge—producing formal, checkable proofs of world-level Olympiad mathematics—and continues to serve as a barometer for progress toward AI agents that can rival or surpass top human mathematical ability.