miniF2F-test: Formal Reasoning Benchmark

Updated 16 May 2026

miniF2F-test is a rigorously curated benchmark that evaluates formal reasoning systems by integrating informal-to-formal translation with automated theorem proving.
The dataset comprises 488 stratified problems from diverse Olympiad and academic sources, covering topics like algebra, combinatorics, and analysis.
Its evaluation pipeline exposes challenges in autoformalization, semantic alignment, and proof search, driving advances in neural and symbolic reasoning agents.

The miniF2F-test benchmark is a rigorously curated dataset designed for the evaluation of neural theorem proving and formal reasoning systems, with a particular focus on Olympiad-level mathematics. It provides a cross-system testbed for assessing the capabilities of both automated and learning-based formal reasoning agents in tasks that combine informal-to-formal translation and actual theorem proving.

1. Origin, Scope, and Dataset Structure

miniF2F (Mini Formal-to-Formal, or “miniF2F-test” for the primary evaluation split) was introduced to address the lack of unified, challenging benchmarks targeting high-school Olympiad mathematics within automated formal reasoning. The dataset consists of 488 problem instances, stratified evenly into 244 test and 244 validation problems. Its problems are sourced from:

International Mathematical Olympiad (IMO), American Mathematics Competitions (AMC), and American Invitational Mathematics Examination (AIME)
Textbook-style high school and early undergraduate mathematics content
Problems cover algebra, number theory, combinatorics, inequalities, recurrences, and elementary analysis (Zheng et al., 2021, Ospanov et al., 5 Nov 2025)

Each problem instance contains:

An informal English statement (often with a proof sketch)
At least one formalized statement in Lean (Lean 3/4), Metamath, and for subsets, Isabelle/HOL and HOL Light
A mechanized proof in the given formal system (except for some competition-style problems, where the formal statement may reflect solution ambiguity)

The dataset adheres to strict naming conventions and per-prover formatting, facilitating cross-system evaluation and translation. Problems are split “stratified by topic and difficulty,” ensuring robustness against topic imbalance.

2. Evaluation Pipeline: End-to-End Formal Reasoning

miniF2F-test is designed to evaluate the end-to-end capacity of an AI system to act as a math Olympiad solver. The canonical pipeline consists of three main stages:

Natural Language Comprehension: Read and parse the informal statement.
Autoformalization: Produce a Lean (or other system) statement faithful to the original mathematics. This step is typically performed by LLMs or specialized translators (e.g., Herald).
Theorem Proving: Attempt a formal proof using a neural or symbolic prover.

Credit is awarded only if the formal proof “corresponds to the original informal statement presented to the model.” This protocol exposes compounding failure modes: syntactic errors in autoformalization, semantic mismatches between informal and formal statements, and genuine limitations in theorem proving (Ospanov et al., 5 Nov 2025).

A typical evaluation loop includes:

Translating the informal statement to Lean syntax via an LLM-based formalizer.
Accepting only the first translation that passes Lean REPL parsing.
Passing the successfully parsed statement to a proof search module (e.g., Kimina-Prover, DeepSeek-Prover).
Verifying the produced proof and semantically aligning the proved statement with the original informal goal.

3. Benchmark Evolution: v1, Failure Modes, and v2 Corrections

v1: Initial Release and Mismatches

The original miniF2F-test (v1) surfaced systematic issues that limited the utility of state-of-the-art (SoTA) LLM-to-formal pipelines as reliable benchmarks:

Over 50% of problems exhibited significant “discrepancies between the formal and informal statements” due to missing hypotheses, dropped or added assumptions, and mismatched quantification or result types.
Notable cases included multiple-choice questions expressed as single-goal formals or formal statements containing explicit solutions—resulting in artificially inflated apparent prover success rates.
Human checking revealed that while LLM-based equivalence checks in autoformalization could achieve up to 97% on v1, human-aligned success was around 66–69%. End-to-end accuracy dropped below 36%, despite individual autoformalization and proving submodules reaching 97% and 69%, respectively (Ospanov et al., 5 Nov 2025).

Failure Mode Taxonomy

Autoformalization Failures: Syntactically invalid Lean output (3–30% of cases)
Translation Failures: Syntactically valid but semantically deviating Lean (over half the problems)
Prover Failures: Proof search genuinely failing on faithful statements

Prominent semantic mismatches include missing quantifiers or domains, incorrect handling of limits, swapped inequalities, loss of geometric context, and recursion misrepresentation (e.g., depth-limited instead of limit-based constructs).

v2: Systematic Corrections

miniF2F-v2 eliminated all erroneous/unprovable statements, removed artificial simplifications, and ensured rigorous alignment across all 488 problems. Two subvariants were constructed:

v2s (“simplified-aligned”): Informal statements include solutions or choices so that the formal and informal tasks are precisely matched.
v2c (“competition style”): Informal statements maintain the original competitive format (multiple choices, find-and-prove tasks), with formal statements adapted accordingly (e.g., existential forms, option enumeration).

This correction enables accurate diagnosis of translation and search challenge difficulty and removes spurious easy cases that do not reflect actual Olympiad problem-solving rigor (Ospanov et al., 5 Nov 2025).

4. Quantitative Baselines and Performance Statistics

The table below summarizes salient results on miniF2F-test and its v2 successors (Herald @128 + Kimina Prover system):

Version	Autoformalization (LLM/human)	Prover Proof Rate	End-to-End Accuracy
v1	97% / 66%	70%	34.8%
v2s	68.9% (valid)	72.1% (valid)	44.7% (test)
v2c	60.2% (valid)	–	40.6% (test)

Key insights:

End-to-end accuracy increased substantially after rigorous v2 corrections, with up to 10 percentage points gained by removing “unprovable and oversimplified problems.”
Even after v2 improvements, current pipelines are limited to ≈45% end-to-end accuracy on the (harder) v2c evaluation, demonstrating persistent challenges in faithful translation and theorem search.
The gap between autoformalization and provability rates accentuates the fundamental challenge: alignment, not just completion.

The original test split also formed the practical basis for subsequent cross-system baselines; e.g., Lean tidy (deterministic heuristic) solves ≈18%, Lean GPT-f/PACT (700M parameter) achieves ≈24.6% Pass@1 and 29.2% Pass@8 on the test set (Zheng et al., 2021).

5. Cross-System Translation, Extensions, and Specialized Tracks

miniF2F-test serves not only as an LLM-to-Lean benchmark but also as a template for cross-system and advanced evaluations:

Multi-Prover Support: Formalizations are provided for Lean (Lean 3/4), Metamath, Isabelle/HOL (partial), and HOL Light (partial), supporting cross-system benchmarking and translation tasks (Zheng et al., 2021).
Dafny Translation: The benchmark has been ported to the auto-active verifier Dafny. Out-of-the-box SMT automation proves 40–45% of test/validation problems (“empty-proof baseline”); LLM-guided hinting (e.g., using Claude Sonnet 4.5) increases coverage to 55.7% (pass@4, iterative error correction) (Baksys et al., 11 Dec 2025).
Rocq Case Study: 478/488 theorems were automatically translated into Rocq via staged LLM prompting with syntactic/semantic feedback, demonstrating that systematic prompt refinement and error correction can port nearly the entire dataset into a new dependent-type system (Viennot et al., 11 Feb 2025).
Problem-Solving Track: The miniF2F-Solving split (375 “solve for an unknown” problems) evaluates pipeline completeness at the solution synthesis level, using Restricted Propositional Equivalence (RPE) for correctness. State-of-the-art achieves 22–27% “solved” rate (FPS), but up to 54% when the solution is hand-supplied (“proven”), highlighting the challenge gap between solution synthesis and mere proof-filling (Liu et al., 7 May 2025).

6. Metrics, Protocols, and Use Cases

The evaluation framework of miniF2F-test is distinguished by:

Pass@k Protocols: Estimation of problem-solving rate over k independent proof search attempts; Pass@1 and Pass@8 are standard (e.g., GPT-f/PACT models).
Human Verification: Automatic equivalence checks are supplemented by human alignment audits, identifying hidden drift or unprovable formalizations.
Pipeline Completeness: Only problems both syntactically valid, semantically faithful, and fully proved receive credit; partial progress is explicitly penalized.

Use cases include:

Cross-system benchmarking of neural theorem provers and pipeline agents (e.g., GPT-f, DeepSeek, Kimina)
Training and evaluating autoformalization models for informal-to-formal translation
Analysis of proof search heuristics versus end-to-end learned approaches
Assessing transfer learning between formalization systems and mathematical domains
Serves as teaching and curriculum material for formal proof education

7. Impact, Limitations, and Future Directions

miniF2F-test and its v2 corrections have catalyzed research into robust, semantically aligned formal reasoning evaluations. The elimination of artificial simplifications in v2 better isolates model limitations and prevents misleading benchmarking. Broad multi-system translation efforts—most notably Dafny and Rocq—demonstrate its utility as a lingua franca for theorem provers and agent pipelines.

Persistent challenges are:

Alignment between informal problem understanding and formal statement synthesis
Robustness to domain shifts, such as competition-style ambiguity or multi-solution settings (v2c)
Proof complexity for problems exceeding routine algebraic manipulations
Clearer separation of translation, solution synthesis, and proof-checking roles

Planned directions include expansion to advanced topics (geometry, combinatorics), the introduction of “challenge” subsets, richer proof-complexity metrics, and more systematic reuse of learned lemmas across problem instances (Zheng et al., 2021, Ospanov et al., 5 Nov 2025, Liu et al., 7 May 2025). The ecosystem formed by miniF2F supports continuous community-driven evolution, ensuring its central role in the progress of human