MiniF2F-Hard: A Formal Math Benchmark Challenge

Updated 4 July 2026

MiniF2F-Hard is a complex subset of the miniF2F benchmark featuring Olympiad-level problems defined through psychometric, performance, and reannotation criteria.
It incorporates multiple approaches such as Lean 4 Hard Mode reannotation and psychometric grading to evaluate theorem proving challenges.
MiniF2F-Hard influences cross-system evaluations by testing semantic fidelity and formal proof strategies across various proof assistants.

MiniF2F-Hard denotes the harder portion of the miniF2F benchmark for formal Olympiad-level mathematics, but the term is not uniform across the literature. In the original miniF2F release, the benchmark consists of 488 formal problem statements, split into 244 validation and 244 test problems, drawn from IMO, AIME, AMC, MATH, and custom high-school or undergraduate material, and no official split called “MiniF2F-Hard” is defined (Zheng et al., 2021). Later work uses the label in several distinct but related senses: as the IMO-level slice of miniF2F, as a psychometrically defined high-difficulty subset, as a performance-defined Olympiad tail, and as a Lean 4 “Hard Mode” reannotation in which answer-bearing statements are rewritten so that the answer must be discovered rather than merely proved (Liu et al., 17 Apr 2026).

1. Origin in the miniF2F benchmark

MiniF2F was introduced as “a dataset of formal Olympiad-level mathematics problems statements intended to provide a unified cross-system benchmark for neural theorem proving” (Zheng et al., 2021). Its first version contains 488 statements, with 244 validation statements and 244 test statements. The distribution is stratified across source categories and, where available, difficulty metadata. In particular, the validation and test sets each contain 20 IMO problems, 15 AIME problems, 45 AMC problems, MATH-derived algebra and number theory problems at levels 1–5, and custom problems in algebra, number theory, and induction (Zheng et al., 2021).

Within that original design, hardness is present only implicitly. For MATH-derived problems, the benchmark inherits MATH’s level $1,\dots,5$ labels, and the original paper reports that higher MATH levels are harder for baseline systems. For Olympiad sources, hardness is associated qualitatively with IMO, AIME, and harder AMC problems, but no official subset called MiniF2F-Hard is introduced (Zheng et al., 2021).

This historical point matters because later uses of the term “MiniF2F-Hard” are all secondary constructions built on top of miniF2F rather than part of the original benchmark specification. A plausible implication is that MiniF2F-Hard should be understood less as one canonical dataset than as a family of hard-regime views over the same underlying benchmark.

2. Competing formalizations of “hard”

Several later papers define or operationalize hardness in different ways. These constructions are compatible in spirit but not identical in granularity, semantics, or evaluation objective.

Interpretation	Operational criterion	Source
Implicit hard core in original miniF2F	Higher MATH levels and harder Olympiad-source problems; no official hard split	(Zheng et al., 2021)
Psychometric hard subset	Level 4 theorems unsolved by all four annotation LLMs; hard can also mean Levels 3–4	(Zhang et al., 2 Feb 2025)
Olympiad-level performance tier	Models with miniF2F test pass@32 $> 75\%$	(Denamganaï, 27 May 2026)
Hard Mode MiniF2F-Hard	Lean 4 reannotation of solution-style miniF2F-test problems so answers are not embedded in the theorem statement	(Liu et al., 17 Apr 2026)

In the psychometric view, “miniF2F-Graded” assigns each theorem a difficulty and discrimination score derived from model behavior. Level 4 consists of theorems unsolved by all four annotation LLMs and has count 127; Levels 1–3 partition the remaining 361 theorems by normalized difficulty ranges, with hard theorems often taken to mean Level 3 or Level 4, especially Level 4 where the difficulty metric is exactly 1 (Zhang et al., 2 Feb 2025).

In the performance-based view, the hard regime is not an item-level subset but a model-level threshold. One paper treats the “Olympiad-level tier” as the regime where miniF2F test pass@32 exceeds 75%, motivated by the fact that 80 of the 244 test problems are AMC, AIME, or IMO problems (Denamganaï, 27 May 2026). This use of “hard” is therefore performance-relative: it identifies models that must be solving a substantial fraction of the Olympiad tail, rather than naming the tail directly.

The Lean 4 Hard Mode view is different again. There, MiniF2F-Hard is a benchmark transformation: answer-bearing statements are rewritten so that the answer is hidden behind an abbrev with a sorry, and the prover must first discover the answer and then prove the theorem. This is a semantic redesign of the task rather than a difficulty ranking over fixed statements (Liu et al., 17 Apr 2026).

3. Hard Mode MiniF2F-Hard in Lean 4

The most explicit dataset named MiniF2F-Hard is introduced in the Lean 4 Hard Mode literature. That work distinguishes Easy Mode, where the final answer is already embedded in the formal theorem statement, from Hard Mode, where any quantity a human competitor must derive is not supplied in the premises or as a concrete constant in the goal (Liu et al., 17 Apr 2026). In Hard Mode, solution-style problems are encoded with two placeholders: one for the answer and one for the proof. A typical pattern is an abbrev solution : ... := sorry, followed by a theorem referring to that symbol.

This reannotation is applied to miniF2F-test. The paper states that MiniF2F-Hard contains 244 total problems and that 194 of them are Hard Mode, while another table uses 197 as the Hard Mode count; the paper does not resolve this inconsistency explicitly (Liu et al., 17 Apr 2026). The remaining problems are proof-style problems that already lack a separate answer-discovery phase and therefore remain essentially in Easy Mode.

The annotation process follows three explicit principles: Semantic Accuracy, Interpretability, and Consistency. The authors manually re-examine each statement, avoid auto-formalization as the source of ground truth, and use Lean experts with more than one year of Lean-related experience. For MiniF2F and FIMO, each problem is independently annotated by two experts. The work also reports that approximately 15 errors in MiniF2F and approximately 20 in FIMO were fixed during reannotation (Liu et al., 17 Apr 2026).

A characteristic example is mathd_algebra_320. In the original Easy Mode formalization, the answer 26 is embedded in the goal and an additional premise c = 2 leaks part of the derivation. The Hard Mode version introduces mathd_algebra_320_solution : ℕ := sorry, removes the leaked value of c, and instead makes canonicality conditions explicit through hypotheses such as Nat.gcd a c = 1 and Squarefree b (Liu et al., 17 Apr 2026). The resulting statement more closely matches the original contest task: the system must derive the simplified representation and then the final value.

This suggests that the Hard Mode version of MiniF2F-Hard is not merely harder computationally. It is also intended to be more faithful to the semantics of competition mathematics, where answer discovery is part of the problem rather than side information furnished by the theorem statement.

4. Psychometric, capability-based, and saturation views of the hard tail

One line of work reconstructs hardness from model behavior rather than from source provenance. In miniF2F-Graded, theorem difficulty is computed from per-theorem success rates of four annotation models, with a correction term that penalizes theorems solved by weaker models. The resulting benchmark defines four levels: Level 1 with normalized difficulty $0 \le x \le 0.5539$ and 120 theorems, Level 2 with $0.5539 < x \le 0.7661$ and 120 theorems, Level 3 with $0.7661 < x \le 0.9864$ and 121 theorems, and Level 4 with $x = 1$ and 127 theorems, where Level 4 consists of items unsolved by all four annotation LLMs (Zhang et al., 2 Feb 2025). On all 488 theorems, the average $Pass@128$ over evaluation models decreases monotonically from $0.7097$ on Level 1 to $0.3917$ on Level 2, $0.1019$ on Level 3, and $> 75\%$ 0 on Level 4 (Zhang et al., 2 Feb 2025). In this interpretation, MiniF2F-Hard is naturally approximated by Level 4, or more broadly by Levels 3–4.

A second line of work treats the hard regime as a capability threshold. In a study of Compositional Learning Behaviours, miniF2F serves as a whole-proof benchmark over the 244-problem test split, of which 80 problems—32.8%—come from AMC, AIME, and IMO. The paper defines the “Olympiad-level tier” operationally as miniF2F test pass@32 $> 75\%$ 1, arguing that below that range a model may still be covering mostly the easier bulk, whereas above it the model must be solving a substantial fraction of the 80 competition problems (Denamganaï, 27 May 2026). Across ten Lean 4 provers, the five models that cross the 75% threshold are exactly the five highest CLB scorers, with an exact partition test yielding $> 75\%$ 2 (Denamganaï, 27 May 2026). The authors therefore describe CLB competency as necessary but not sufficient for the hard tail.

A third perspective arises once top provers approach saturation on the standard benchmark. Using four strong provers, one study reports that 83.20% of MiniF2F-Test problems are solved by all four models, that only 41 of 244 problems have at least one model failing, and that only 22 of 244 problems are failed by all four models (Leang et al., 10 Jun 2026). Within these failure sets, the composition is heavily skewed toward IMO and some AMC problems: in the all-models-wrong subset, IMO is approximately 50% and AMC approximately 27% (Leang et al., 10 Jun 2026). This effectively shrinks “MiniF2F-Hard” to a residual unsolved tail rather than a broad named subset.

Taken together, these views show that hardness in MiniF2F has at least three non-equivalent operational meanings: psychometric rarity, capability threshold, and residual unsolved tail. This suggests that reported progress on “MiniF2F-Hard” is only comparable when the paper specifies which of these meanings it is using.

5. Cross-system and cross-paradigm extensions

MiniF2F-Hard has also become relevant in cross-assistant and cross-paradigm settings. In Rocq, the entire 488-theorem MiniF2F corpus is treated as a translation target for theorem statements, not proofs. The translation uses three inputs for each theorem—the natural-language description, the Lean formalization, and the Isabelle formalization—and seeks a Rocq theorem statement that parses and type-checks. The reported result is 478 successful translations out of 488, leaving 10 unresolved (Viennot et al., 11 Feb 2025). The paper does not explicitly reintroduce the MiniF2F-Hard name or report per-hard-split statistics, but it states that whatever portion of MiniF2F is classified as MiniF2F-Hard in the original benchmark is present in the Rocq corpus, except for hard problems that might lie among the 10 unresolved translations (Viennot et al., 11 Feb 2025). Here, hard problems matter as difficult cases in cross-assistant statement alignment, particularly for complex numbers, finite sums and products, primes, floor functions, and typing issues.

A different extension recasts miniF2F in Dafny, the first translation of the benchmark to an auto-active verifier rather than an interactive theorem prover. That benchmark covers the full 488 problems. On the empty-proof baseline, Dafny verifies 99 of 244 test problems, or 40.6%, and 109 of 244 validation problems, or 44.7%, with {} as the proof body (Baksys et al., 11 Dec 2025). The remaining 145 test problems and 135 validation problems form the “hard” portion in the Dafny sense: they are the problems for which automation alone fails and LLM-generated proof hints are required (Baksys et al., 11 Dec 2025). The best reported model achieves 55.7% pass@4 on the full test set with iterative error correction, which corresponds to roughly a quarter of the hard test problems being recovered by LLM hints beyond the empty-proof baseline (Baksys et al., 11 Dec 2025).

These extensions are significant because they decouple different aspects of hardness. In Rocq, the challenge is faithful re-expression of hard statements across proof assistants. In Dafny, the challenge is determining which hard statements become easy for SMT-backed automation and which still require nontrivial proof guidance. A plausible implication is that MiniF2F-Hard is not invariant under assistant choice: the same mathematical problem can shift between hard and easy depending on library coverage, automation strength, and representation.

6. Proof availability, benchmark quality, and the role of MiniF2F-Hard

The IMO slice of miniF2F has often functioned as an informal hard core. One Lean-focused study emphasizes that the miniF2F test set contains 20 IMO problems, yet before that work only 7 had public Lean proofs, with 3 of those written by mathematicians rather than automated systems (Yousefzadeh et al., 2024). The authors then provide complete formal proofs for the remaining miniF2F IMO problems and decompose 12 IMO problems into 907 nontrivial lemmas with 25,480 lines of Lean 4 code (Yousefzadeh et al., 2024). This turns the hardest Olympiad-style portion of miniF2F into a finer-grained diagnostic resource for training and evaluating systems below the full-problem level.

A separate benchmark-quality analysis argues that the original miniF2F often failed to represent Olympiad difficulty faithfully. That work reports discrepancies between formal and informal statements for more than half of the problems in miniF2F and states that about 40% of formal statements have errors or misalignments (Ospanov et al., 5 Nov 2025). It introduces miniF2F-v2 with corrected formal and informal statements and proofs, and distinguishes a simplified aligned version, miniF2F-v2s, from a competition-style version, miniF2F-v2c, in which multiple-choice structure, answer discovery, and other contest semantics are restored (Ospanov et al., 5 Nov 2025). In end-to-end informal-to-formal-to-proof evaluation, the best pipeline reaches about 36% on the original miniF2F but 70% on miniF2F-v2, while the competition-style v2c remains harder than v2s and exposes greater prover difficulty on IMO, AMC, and AIME subsets (Ospanov et al., 5 Nov 2025).

This benchmark-quality work reframes MiniF2F-Hard in an important way. Hardness is not only a matter of source difficulty or prover pass rates; it is also a matter of semantic fidelity. If a formal statement embeds the answer, weakens the goal, or introduces extra premises, then a theorem prover may obtain credit on a task that is substantially easier than the original Olympiad problem. The Hard Mode reannotation and the v2c competition-style corrections can therefore be read as two complementary attempts to restore the intended meaning of “hard” in MiniF2F-Hard.

In contemporary usage, MiniF2F-Hard is best understood as a moving interface between benchmark design and prover capability. In one direction, it names the competition-level, semantically faithful tasks that preserve answer discovery and full problem scope. In another, it names the residual set of statements, lemmas, or theorem-proving regimes that remain difficult after strong models nearly saturate the standard benchmark. The literature converges on the importance of that hard regime, but not on a single canonical construction of it.