miniF2F & FIMO Math Benchmarks

Updated 18 January 2026

miniF2F and FIMO are formal mathematics benchmarks that test automated theorem proving using rigorously formalized problems from high-school contests to advanced IMO challenges.
miniF2F offers 488 cross-system problems (spanning AMC, AIME, IMO, and undergraduate topics), while FIMO focuses exclusively on deeply creative algebra and number theory tasks from the IMO Shortlist.
Both benchmarks employ pass@k evaluation metrics and integrate LLM-guided, auto-active, and interactive proof strategies, driving advancements in ATP methodologies and autoformalization.

miniF2F and FIMO are formal mathematics benchmarks that define the current frontier for automated theorem proving at advanced competition and olympiad levels. miniF2F is designed as a broad, cross-system suite covering high-school and early undergraduate mathematics, emphasizing coverage across proof assistants and problem types, while FIMO concentrates exclusively on deeply creative algebra and number theory problems sourced from the International Mathematical Olympiad (IMO) Shortlist, pushing toward true IMO-level challenge. Both benchmarks have catalyzed the development of new methodologies in formal autoformalization, automated theorem proving (ATP), and LLM-guided mathematical reasoning.

1. Benchmark Structure and Problem Sourcing

miniF2F comprises 488 formal problem statements (244 test / 244 validation), sourced from the AMC, AIME, IMO, and a wide selection of undergraduate mathematics topics. Of these, only 40 are authentic IMO/IMO Shortlist problems; the rest capture a gradient of difficulty spanning routine high-school exercises to olympiad-style challenges. Problems are meticulously formalized in multiple proof assistants, including Lean, Isabelle/HOL, Metamath, and HOL Light, to enable cross-system benchmarking and direct comparison of ATP approaches across different formal frameworks (Zheng et al., 2021).

FIMO (Formal IMO-level) contains 149 (in some sources, 148) formal Lean statements, exclusively targeting algebra and number theory problems from the IMO Shortlists (2006–2021). Each FIMO entry is delivered as a triple: the LaTeX-formatted informal statement, the official IMO Shortlist proof in LaTeX, and a formal Lean encoding with a placeholder or completed proof. The selection process achieved a 60.8% formalization success rate over all available shortlist items, after human-in-the-loop verification of LLM-formalized problems (Liu et al., 2023).

Benchmark	# Problems	Math Level	System Coverage	Informal Proofs
miniF2F	488	AMC/AIME/IMO/Undergrad	Lean, Metamath, Others	Source text only
FIMO	149	IMO Shortlist (Algebra/NT)	Lean	Yes (LaTeX)

2. Formal Encoding and Representation

miniF2F statements are fully formalized and reviewed, with each problem represented as a formal theorem or lemma in the target proof system. Special care is taken to resolve informal mathematical ambiguity: multi-choice questions are rephrased as assertions fixing the correct answer, "witness" problems provide an explicit solution, and all necessary side-conditions are formalized (e.g., nonnegativity, triangle inequalities). The Lean and Metamath encodings (along with partial Isabelle/HOL, HOL Light ports) serve as the reference implementations. Manual formalization and review require approximately 22 minutes per problem (Zheng et al., 2021).

FIMO employs explicit Lean theorem declarations, bundling all hypotheses as conjunctions and specifying all domain, positivity, and type constraints. The LaTeX informal statement and solution, provided for every problem, facilitate hybrid reasoning settings and support evaluation of both closed-problem ATP and informal-to-formal proof translation. This structural pairing is crucial for new "Draft, Sketch, and Prove" (DSP) workflows that link human and machine representations (Liu et al., 2023).

3. Evaluation Protocols and Metrics

Both benchmarks primarily use the pass@k metric for ATP performance: the fraction of problems solved (i.e., proofs accepted by the Lean kernel with no sorry) within k attempts per problem.

Formally, for a benchmark with $|P|$ problems and $k$ independent proof attempts per problem,

$\mathrm{pass@}k = \frac{1}{|P|} \sum_{p \in P} \mathbf{1}( \exists\, j \leq k : \texttt{proof}_{p,j} \text{ accepts} )$

"Pass@N" refers to N independent decodings per theorem; "cumulative accuracy" is pass@ $k_{\max}$ for the largest attempted k (often in the thousands for RL-based methods) (Xin et al., 2024).

Additional end-to-end metrics, as used in recent miniF2F-v2 analysis, require semantic alignment between the informal source statement, the formalization, and the accepted proof:

$\mathrm{Acc}_{\mathrm{E2E}} = \frac{1}{N} \sum_{i=1}^N \mathbf{1}\bigl( F_i \text{ compiles, proved, and matches } s_i \bigr)$

This corrects for mismatches and over-simplifications found in earlier miniF2F versions (Ospanov et al., 5 Nov 2025).

4. Automation, LLM-Guided Proving, and System Integration

miniF2F-Dafny translated the benchmark into Dafny, moving away from purely interactive proofs to "auto-active" verification. Dafny's SMT automation (Z3 backend) handled approximately 40–45% of miniF2F test/validation problems with "empty" (no-step) proofs. Remaining problems are addressed using LLMs to generate high-level proof hints; the best model (Claude Sonnet 4.5) achieved $\mathrm{Pass@4} = 55.7\%$ using up to 4 error-corrected attempts per problem. The system leverages a division of labor: LLMs propose strategic assertions and lemma applications, while Dafny discharges low-level equational and quantifier reasoning, eliminating the need for LLMs to supply detailed stepwise algebra (Baksys et al., 11 Dec 2025).

A direct comparison to FIMO demonstrates the impact of automation: FIMO’s interactive Lean tasks admit less than 10% auto-active (script-free) solutions, and even LLM-guided approaches (GPT-f, etc.) reach only 20–30% $\mathrm{pass@4}$ . FIMO’s higher difficulty reflects both the intrinsic mathematical depth and the current limits of ATP and system automation (Baksys et al., 11 Dec 2025, Liu et al., 2023).

5. Baseline Results and System-Level Comparisons

Benchmark	Auto-active (%)	LLM-guided (%)	System	Proof Style
miniF2F-Dafny	40–45	≈55.7% ( $\mathrm{pass@4}$ )	Dafny	auto-active
FIMO	<10%	~30% ( $\mathrm{pass@4}$ )	Lean	interactive

On the original miniF2F test set (Lean), state-of-the-art models such as DeepSeek-Prover (7B, trained on 8M synthetic theorem–proof pairs) reach 52.0% cumulative accuracy with large sample budgets (65536 attempts), compared to GPT-4's 23.0% at 64 attempts. Hypertree Proof Search (MCTS with 64 beams × 5000 rollouts) achieves 41.0%. On FIMO, DeepSeek-Prover successfully proved 5/148 problems (3.4%) at large sample budgets, while GPT-4 solved none (Xin et al., 2024). End-to-end pipeline accuracy on the revised miniF2F-v2 dataset stands at ≈44.7%–70% depending on adherence to problem fidelity and proof alignment practices (Ospanov et al., 5 Nov 2025).

6. Curation Practices, Limitations, and Future Development

Audits of miniF2F’s original releases revealed that over half of pipeline failures arose from informal–formal mismatches: incomplete or oversimplified informal statements, unprovable formalizations, or extraneous assumptions. The construction of miniF2F-v2 entailed systematic manual correction, restoring missing hypotheses and alignment, and introducing two-tier scoring ("simplified" versus "competition" variants). This curation approach, emphasizing rigorous manual vetting and human-verified translation, establishes necessary standards for reliable benchmark advancement (Ospanov et al., 5 Nov 2025).

FIMO leverages paired LaTeX statements/proofs and formalizations to support hybrid evaluation protocols. This enables research into DSP workflows, premise-selection augmentation, and multi-stage LLM-guided search. Current limitations for both benchmarks include under-representation of geometry/combinatorics in FIMO, sensitivity to proof assistant/library evolution, and a persistent gap between autoformalization and theorem-prover accuracy versus end-to-end gold-standard performance.

Future directions include expansion of domain coverage (notably combinatorics/geometry in FIMO), improved autoformalization cycles, tree-search/LLM hybrid ATP, and graduated curricula for progressive capability assessment (Liu et al., 2023). Continuous benchmark maintenance and explicit handling of multiple-choice/problem selection remain best practices at the field's frontier (Ospanov et al., 5 Nov 2025).

7. Significance and Impact

miniF2F and FIMO define and measure core objectives in formal mathematical automation. miniF2F’s breadth and multi-system coverage facilitate generalization studies and neural ATP comparisons at moderate-to-high difficulty. FIMO isolates the "IMO Grand Challenge" at scale, establishing that even state-of-the-art LLMs (GPT-4, DeepSeek-Prover) are far from robust, human-level performance. The design principles, curation efforts, and paired informal–formal structure underpin emerging research in both whole-proof automation and informal-to-formal pipeline development, directly shaping next-generation methodologies in machine mathematics (Baksys et al., 11 Dec 2025, Liu et al., 2023, Xin et al., 2024, Zheng et al., 2021, Ospanov et al., 5 Nov 2025).