IndiMathBench: Automated Math Benchmarks
- IndiMathBench is a dual benchmark framework combining a 2023 SMT-based inductive theorem proving dataset and a 2025 human-verified Lean 4 Olympiad corpus.
- It leverages neural-guided synthesis and a human–AI hybrid autoformalization pipeline to generate over 30K conjectures and 312 formal theorems for diverse mathematical domains.
- The benchmarks provide actionable metrics—including compile success, BEq rates, and proof synthesis efficacy—to drive improvements in automated reasoning systems.
IndiMathBench refers to two established, rigorously documented mathematical reasoning benchmarks, each addressing different facets of automated reasoning and theorem proving. The first, introduced in 2023, is a large-scale SMT-LIB dataset focused on inductive theorem proving for integer sequences, leveraging program equivalence over recursive, looping constructs. The second, presented in 2025, comprises a human-verified Lean 4 corpus of Olympiad-level theorems from the Indian Mathematical Olympiad system, targeting autoformalization and autonomous proof synthesis. Both benchmarks serve as critical testbeds for evaluating and advancing the state of automated, neural, and human-in-the-loop mathematical reasoning systems (Gauthier et al., 2023, Biyani et al., 30 Nov 2025).
1. Benchmark Compositions and Origins
1.1 Inductive Theorem Proving Benchmark (2023)
Composed of 29,687 conjectures, this IndiMathBench variant is constructed from the On-Line Encyclopedia of Integer Sequences (OEIS). Each conjecture specifies the equivalence where and are distinct programs—each synthesizing the same OEIS sequence via recursion and looping—discovered using a neural-guided program synthesis system in a domain-specific language with explicit looping operators. The benchmark emphasizes recursive reasoning on ℕ and is intended to stress-test current and future inductive theorem provers (Gauthier et al., 2023).
1.2 Olympiad-level Lean 4 Benchmark (2025)
The 2025 IndiMathBench instance encompasses 312 formal Lean 4 theorems, each paired with its corresponding natural-language statement. All problems originate from Indian Mathematical Olympiad contests—specifically the RMO and INMO—and are rigorously verified by expert annotators following LLM-accelerated autoformalization. The theorem set spans geometry, algebra, number theory, and combinatorics, with nontrivial geometry and parity challenges and coverage of both elementary and deep results (Biyani et al., 30 Nov 2025).
2. Dataset Structure and Formal Problem Representation
2.1 Recurrence-based Equivalence Problems
Problem statements are synthesized over a well-specified grammar: Looping operators encode single and mutual recursion, and compr supports searching for roots. Each program is interpreted as a total function , with concrete operational semantics reflecting arithmetic, case analysis, and recursion (see full rules in (Gauthier et al., 2023), §2.3).
2.2 Olympiad Autoformalization and Lean 4 Encodings
Each Olympiad problem is presented both in LaTeX (informal) and validated Lean 4 (formal) format. For example, a parity problem on integer floors is encoded as:
1 2 3 4 5 6 |
import Mathlib
theorem inmo_2014_2 (n : ℕ) :
Even ((Finset.sum (Finset.range n) fun i =>
Int.floor ((n : ℝ) / (i + 1 : ℝ))) +
Int.floor (Real.sqrt n)) := by
sorry |
3. Construction and Formalization Pipelines
3.1 OEIS-based Inductive Problems
A neural-guided synthesis engine iterates over the OEIS for 209 epochs. For each integer sequence , two distinct, non-duplicate programs are selected: (a) the “smallest” and (b) the “fastest”, under coverage and equality constraints on up to 100 terms. Resulting conjectures exhibit wide structural and arithmetic variety and are filtered for full evaluation feasibility in SMT workflows (Gauthier et al., 2023).
3.2 Human–AI Hybrid Autoformalization of Olympiad Problems
A three-stage pipeline accelerates annotation:
- Preprocessing: Problems are categorized (geometry/algebra/etc.), with retrieval agents extracting Mathlib context.
- Formalization: LLMs generate Lean 4 code from informal text and context; iterative Lean compiler feedback guides up to six correction cycles per problem:
1 2 3 4 5 6 7 8 9
procedure Formalize(p, Model): ctxt ← p.Category.Context f ← Model(p, ctxt) for i in 1..6: errors ← ValidateInLean(f) if errors.empty: break feedback ← ParseErrors(errors) f ← Model(p, ctxt, f, feedback) return f - Dashboard Verification: Multiple model generations are ranked using GPT-5-based summaries for syntactic validity and faithfulness; annotators finalize theorems by merging and refining fragments via a VS Code extension. This achieves a 3.5× speedup over pure manual annotation (Biyani et al., 30 Nov 2025).
4. Evaluation Protocols and Metrics
4.1 Inductive SMT Benchmark
Problems are packaged as SMT-LIB v2.6 files, each encoding program definitions, axioms, and a negated conjecture. A typical template is:
1 2 3 4 5 6 7 8 |
; declare f_small, f_fast
(declare-fun small (Int) Int)
(declare-fun fast (Int) Int)
; ... axioms ...
(assert (exists ((c Int))
(and (>= c 0)
(not (= (small c) (fast c))))))
(check-sat) |
4.2 Autoformalization and Theorem Proving for Lean 4
Multiple LLMs are evaluated along four axes:
- Compile Success: Fraction of outputs passing Lean type-checking.
- Semantic Correctness (BEq): Bidirectional Extended Definitional Equivalence between candidate and ground-truth theorems.
- Structural Similarity (GTED): Generalized Tree Edit Distance between Lean expressions.
- Proof Synthesis: Fraction where the
sorryplaceholder can be replaced with a full machine-verifiable proof (single-turn and multi-turn agentic settings). Results are aggregated per model and domain, with geometry and combinatorics showing the lowest autoformalization and proving success rates (Biyani et al., 30 Nov 2025).
| Model | BEq /312 | GTED mean | Compile Success |
|---|---|---|---|
| Claude Opus 4 | 67 | 0.51 | 243 |
| Claude Sonnet 4 | 54 | 0.42 | 215 |
| GPT-5 | 38 | 0.48 | 235 |
| Gemini 2.5 Pro | 47 | 0.24 | 151 |
Only about 160/312 problems were BEq-solved by any model; geometry is particularly challenging.
5. Problem Taxonomy and Difficulty
5.1 Inductive Categories
Difficulty stratification is not absolute but follows:
- Easy: No proper looping or bounded loops, reducing to basic arithmetic/conditionals (6,524 problems; 22%).
- Medium: At least one loop with input-dependent bound, but susceptible to finite unrolling or symbolic inlining (6,966 problems; 23%).
- Hard: Require authentic (often mutual) induction, resisting both syntactic and semantic simplifications (16,197 problems; 55%).
Mathematical domains are distributed as 40% combinatorics, 30% number theory, 10% recursive/factorial types, and 20% other areas (Gauthier et al., 2023).
5.2 Olympiad Theorem Diversity
Subjects cover geometry (31.4%), algebra (29.5%), number theory (24.7%), and combinatorics/set theory (14.4%). Problems are selected for their diversity and absence from Western-centric training corpora, supporting genuine assessment of generalization (Biyani et al., 30 Nov 2025).
6. Empirical Findings, Limitations, and Frontiers
6.1 Inductive Theorem Provers
State-of-the-art solvers (Z3, Vampire, CVC5) achieve the following:
| Subset | CVC5 Solved | % Success |
|---|---|---|
| All | 2,428 | 8.2% |
| Strengthened C1 | 3,793 | 12.8% |
| Syntactic Loop | 2,547 | 11.0% |
| Semantic Induct | 2,059 | 12.7% |
Mutual recurrence and compr-based (e.g., prime search) problems are solved in under 5% of cases. Common failure modes include missing lemma generation (e.g., exponentiation identities), intractable arithmetic from nested recursion, and divergent behavior in div/mod (Gauthier et al., 2023).
6.2 Autoformalization Gaps
Despite high syntactic validity (up to ~78%), BEq rates are much lower, reflecting a semantic gap in model understanding of Mathlib and Lean’s mathematical APIs. Only 51.3% of problems are BEq-solved by at least one model. Multi-turn proof synthesis remains limited, with GPT-5 solving 11% of IndiMathBench in 10 turns (36/312), and geometry proving close to nil. An observed model “refusal mode” highlights agentic awareness of proof difficulty, but does not yield solutions (Biyani et al., 30 Nov 2025).
6.3 Pipeline and Community Impact
The human–AI annotation and dashboard workflow achieves a 3.5-fold speedup over manual formalization, yet the need for more robust semantic validation tools is acute. IndiMathBench’s construction from fresh, regional Olympiad sources mitigates dataset contamination, distinguishing it from benchmarks such as miniF2F and PutnamBench (Biyani et al., 30 Nov 2025).
7. Open Challenges and Prospects
- Semantic autoformalization: Achieving full semantic alignment between informal and formal theorem statements requires advances in model reading comprehension and API mastery.
- Geometry and combinatorics: Model limitations and gaps in Mathlib’s geometry infrastructure impede autoformalization and automated proving in these domains.
- Heuristic and lemma discovery: Automated lemma generation reduces induction depth but incurs search-space complexity. Combining efficient retrieval, proof-planning (e.g., rippling), and LLM–ATP synergy is a current research frontier.
- Evaluation methodology: Metrics like BEq and GTED, as well as pass@k for proof synthesis, provide quantitative grounding but do not assess deeper mathematical insight.
A plausible implication is that IndiMathBench will continue to drive improvements in human–AI collaboration tools, specialized theorem-proving tactics (notably for geometry/parity), and tightly integrated LLM-ATP proof search routines.
IndiMathBench, in both its inductive SMT and Olympiad Lean 4 instantiations, provides a foundational reference point for the ongoing advancement of mathematical autoformalization and neural theorem proving research (Gauthier et al., 2023, Biyani et al., 30 Nov 2025).