Papers
Topics
Authors
Recent
2000 character limit reached

IndiMathBench: Automated Math Benchmarks

Updated 7 December 2025
  • IndiMathBench is a dual benchmark framework combining a 2023 SMT-based inductive theorem proving dataset and a 2025 human-verified Lean 4 Olympiad corpus.
  • It leverages neural-guided synthesis and a human–AI hybrid autoformalization pipeline to generate over 30K conjectures and 312 formal theorems for diverse mathematical domains.
  • The benchmarks provide actionable metrics—including compile success, BEq rates, and proof synthesis efficacy—to drive improvements in automated reasoning systems.

IndiMathBench refers to two established, rigorously documented mathematical reasoning benchmarks, each addressing different facets of automated reasoning and theorem proving. The first, introduced in 2023, is a large-scale SMT-LIB dataset focused on inductive theorem proving for integer sequences, leveraging program equivalence over recursive, looping constructs. The second, presented in 2025, comprises a human-verified Lean 4 corpus of Olympiad-level theorems from the Indian Mathematical Olympiad system, targeting autoformalization and autonomous proof synthesis. Both benchmarks serve as critical testbeds for evaluating and advancing the state of automated, neural, and human-in-the-loop mathematical reasoning systems (Gauthier et al., 2023, Biyani et al., 30 Nov 2025).

1. Benchmark Compositions and Origins

1.1 Inductive Theorem Proving Benchmark (2023)

Composed of 29,687 conjectures, this IndiMathBench variant is constructed from the On-Line Encyclopedia of Integer Sequences (OEIS). Each conjecture specifies the equivalence nN,f1(n)=f2(n)\forall n\in\mathbb{N}, \quad f_1(n) = f_2(n) where f1f_1 and f2f_2 are distinct programs—each synthesizing the same OEIS sequence via recursion and looping—discovered using a neural-guided program synthesis system in a domain-specific language with explicit looping operators. The benchmark emphasizes recursive reasoning on ℕ and is intended to stress-test current and future inductive theorem provers (Gauthier et al., 2023).

1.2 Olympiad-level Lean 4 Benchmark (2025)

The 2025 IndiMathBench instance encompasses 312 formal Lean 4 theorems, each paired with its corresponding natural-language statement. All problems originate from Indian Mathematical Olympiad contests—specifically the RMO and INMO—and are rigorously verified by expert annotators following LLM-accelerated autoformalization. The theorem set spans geometry, algebra, number theory, and combinatorics, with nontrivial geometry and parity challenges and coverage of both elementary and deep results (Biyani et al., 30 Nov 2025).

2. Dataset Structure and Formal Problem Representation

2.1 Recurrence-based Equivalence Problems

Problem statements are synthesized over a well-specified grammar: P::= 012XYP+PPPP×PP div PP mod Pcond(P,P,P) loop(F,A,B)loop2(F,G,A,B,C)compr(F,A)\begin{align*} P ::= & \ 0 \mid 1 \mid 2 \mid X \mid Y \mid P+P \mid P-P \mid P\times P \mid P \ \text{div}\ P \mid P \ \text{mod}\ P \mid \text{cond}(P,P,P) \ & \mid \text{loop}(F,A,B) \mid \text{loop2}(F,G,A,B,C) \mid \text{compr}(F,A) \end{align*} Looping operators encode single and mutual recursion, and compr supports searching for roots. Each program is interpreted as a total function fP:Z2Zf_P: \mathbb{Z}^2 \rightarrow \mathbb{Z}, with concrete operational semantics reflecting arithmetic, case analysis, and recursion (see full rules in (Gauthier et al., 2023), §2.3).

2.2 Olympiad Autoformalization and Lean 4 Encodings

Each Olympiad problem is presented both in LaTeX (informal) and validated Lean 4 (formal) format. For example, a parity problem on integer floors is encoded as:

1
2
3
4
5
6
import Mathlib
theorem inmo_2014_2 (n : ℕ) :
  Even ((Finset.sum (Finset.range n) fun i =>
    Int.floor ((n : ℝ) / (i + 1 : ℝ))) +
    Int.floor (Real.sqrt n)) := by
  sorry
Every problem includes rigorous mathematical context, exploit Mathlib, and is ensured to be type-checkable in the Lean 4 core (Biyani et al., 30 Nov 2025).

3. Construction and Formalization Pipelines

3.1 OEIS-based Inductive Problems

A neural-guided synthesis engine iterates over the OEIS for 209 epochs. For each integer sequence ss, two distinct, non-duplicate programs are selected: (a) the “smallest” and (b) the “fastest”, under coverage and equality constraints on up to 100 terms. Resulting conjectures exhibit wide structural and arithmetic variety and are filtered for full evaluation feasibility in SMT workflows (Gauthier et al., 2023).

3.2 Human–AI Hybrid Autoformalization of Olympiad Problems

A three-stage pipeline accelerates annotation:

  • Preprocessing: Problems are categorized (geometry/algebra/etc.), with retrieval agents extracting Mathlib context.
  • Formalization: LLMs generate Lean 4 code from informal text and context; iterative Lean compiler feedback guides up to six correction cycles per problem:
    1
    2
    3
    4
    5
    6
    7
    8
    9
    
    procedure Formalize(p, Model):
      ctxt ← p.Category.Context
      f ← Model(p, ctxt)
      for i in 1..6:
        errors ← ValidateInLean(f)
        if errors.empty: break
        feedback ← ParseErrors(errors)
        f ← Model(p, ctxt, f, feedback)
      return f
  • Dashboard Verification: Multiple model generations are ranked using GPT-5-based summaries for syntactic validity and faithfulness; annotators finalize theorems by merging and refining fragments via a VS Code extension. This achieves a 3.5× speedup over pure manual annotation (Biyani et al., 30 Nov 2025).

4. Evaluation Protocols and Metrics

4.1 Inductive SMT Benchmark

Problems are packaged as SMT-LIB v2.6 files, each encoding program definitions, axioms, and a negated conjecture. A typical template is:

1
2
3
4
5
6
7
8
; declare f_small, f_fast
(declare-fun small (Int) Int)
(declare-fun fast  (Int) Int)
; ... axioms ...
(assert (exists ((c Int))
               (and (>= c 0)
                    (not (= (small c) (fast c))))))
(check-sat)
Success is defined as an “unsat” return (i.e., proof of equivalence) within a 60s timeout. Metrics include proof time, induction depth, clause/backtrack counts, and recorded induction schema (simple, strong, mutual) as applicable (Gauthier et al., 2023).

4.2 Autoformalization and Theorem Proving for Lean 4

Multiple LLMs are evaluated along four axes:

  • Compile Success: Fraction of outputs passing Lean type-checking.
  • Semantic Correctness (BEq): Bidirectional Extended Definitional Equivalence between candidate and ground-truth theorems.
  • Structural Similarity (GTED): Generalized Tree Edit Distance between Lean expressions.
  • Proof Synthesis: Fraction where the sorry placeholder can be replaced with a full machine-verifiable proof (single-turn and multi-turn agentic settings). Results are aggregated per model and domain, with geometry and combinatorics showing the lowest autoformalization and proving success rates (Biyani et al., 30 Nov 2025).
Model BEq /312 GTED mean Compile Success
Claude Opus 4 67 0.51 243
Claude Sonnet 4 54 0.42 215
GPT-5 38 0.48 235
Gemini 2.5 Pro 47 0.24 151

Only about 160/312 problems were BEq-solved by any model; geometry is particularly challenging.

5. Problem Taxonomy and Difficulty

5.1 Inductive Categories

Difficulty stratification is not absolute but follows:

  • Easy: No proper looping or bounded loops, reducing to basic arithmetic/conditionals (6,524 problems; 22%).
  • Medium: At least one loop with input-dependent bound, but susceptible to finite unrolling or symbolic inlining (6,966 problems; 23%).
  • Hard: Require authentic (often mutual) induction, resisting both syntactic and semantic simplifications (16,197 problems; 55%).

Mathematical domains are distributed as 40% combinatorics, 30% number theory, 10% recursive/factorial types, and 20% other areas (Gauthier et al., 2023).

5.2 Olympiad Theorem Diversity

Subjects cover geometry (31.4%), algebra (29.5%), number theory (24.7%), and combinatorics/set theory (14.4%). Problems are selected for their diversity and absence from Western-centric training corpora, supporting genuine assessment of generalization (Biyani et al., 30 Nov 2025).

6. Empirical Findings, Limitations, and Frontiers

6.1 Inductive Theorem Provers

State-of-the-art solvers (Z3, Vampire, CVC5) achieve the following:

Subset CVC5 Solved % Success
All 2,428 8.2%
Strengthened C1 3,793 12.8%
Syntactic Loop 2,547 11.0%
Semantic Induct 2,059 12.7%

Mutual recurrence and compr-based (e.g., prime search) problems are solved in under 5% of cases. Common failure modes include missing lemma generation (e.g., exponentiation identities), intractable arithmetic from nested recursion, and divergent behavior in div/mod (Gauthier et al., 2023).

6.2 Autoformalization Gaps

Despite high syntactic validity (up to ~78%), BEq rates are much lower, reflecting a semantic gap in model understanding of Mathlib and Lean’s mathematical APIs. Only 51.3% of problems are BEq-solved by at least one model. Multi-turn proof synthesis remains limited, with GPT-5 solving 11% of IndiMathBench in 10 turns (36/312), and geometry proving close to nil. An observed model “refusal mode” highlights agentic awareness of proof difficulty, but does not yield solutions (Biyani et al., 30 Nov 2025).

6.3 Pipeline and Community Impact

The human–AI annotation and dashboard workflow achieves a 3.5-fold speedup over manual formalization, yet the need for more robust semantic validation tools is acute. IndiMathBench’s construction from fresh, regional Olympiad sources mitigates dataset contamination, distinguishing it from benchmarks such as miniF2F and PutnamBench (Biyani et al., 30 Nov 2025).

7. Open Challenges and Prospects

  • Semantic autoformalization: Achieving full semantic alignment between informal and formal theorem statements requires advances in model reading comprehension and API mastery.
  • Geometry and combinatorics: Model limitations and gaps in Mathlib’s geometry infrastructure impede autoformalization and automated proving in these domains.
  • Heuristic and lemma discovery: Automated lemma generation reduces induction depth but incurs search-space complexity. Combining efficient retrieval, proof-planning (e.g., rippling), and LLM–ATP synergy is a current research frontier.
  • Evaluation methodology: Metrics like BEq and GTED, as well as pass@k for proof synthesis, provide quantitative grounding but do not assess deeper mathematical insight.

A plausible implication is that IndiMathBench will continue to drive improvements in human–AI collaboration tools, specialized theorem-proving tactics (notably for geometry/parity), and tightly integrated LLM-ATP proof search routines.

IndiMathBench, in both its inductive SMT and Olympiad Lean 4 instantiations, provides a foundational reference point for the ongoing advancement of mathematical autoformalization and neural theorem proving research (Gauthier et al., 2023, Biyani et al., 30 Nov 2025).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (2)

Whiteboard

Follow Topic

Get notified by email when new papers are published related to IndiMathBench.