Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
144 tokens/sec
GPT-4o
8 tokens/sec
Gemini 2.5 Pro Pro
46 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

PeanoBench: Dataset for Math Proof Evaluation

Updated 30 June 2025
  • PeanoBench is a detailed dataset pairing natural language proof steps with Lean tactics to support research on formal proof translation.
  • It includes 371 annotated proofs segmented into fine-grained steps covering fundamental Peano Arithmetic theorems and common Lean tactics.
  • Integrated with the LeanTutor system, PeanoBench enables precise evaluation of autoformalization, error detection, and pedagogically informed feedback.

PeanoBench is a human-authored dataset designed for the rigorous evaluation of automated systems supporting mathematical proof tasks in Peano Arithmetic, particularly within the context of formalization and educational feedback. Developed as a core component of the LeanTutor project, PeanoBench systematically aligns human-written natural language proof steps with their precise, logically equivalent counterparts in the Lean theorem prover. PeanoBench’s scope, construction, and evaluative significance make it central to contemporary research at the intersection of formal mathematics, AI-guided tutoring, and natural language processing.

1. Dataset Construction and Structure

PeanoBench consists of 371 annotated proofs, each relating to fundamental theorems in Peano Arithmetic and derived from the Natural Numbers Game 4 (NNG4). Each proof is segmented into fine-grained steps, with every step paired in two modalities:

  • Human-written natural language (NL), capturing the informal proof intent in terms reflecting authentic student reasoning and verbalization.
  • Formal Lean syntax (FL), representing the exact tactic or sequence of tactics required for formal verification within Lean.

To capture genuine educational diversity, each theorem in PeanoBench is annotated with several “personas,” such as equation-based and justification-based phrasings, thus reflecting a spectrum of student linguistic styles. Additionally, the dataset includes algorithmically generated incorrect proofs, produced by systematically skipping or modifying steps in correct proofs, and marks the first location of error for each, enabling targeted evaluation of error detection capabilities.

2. Peano Arithmetic Coverage and Lean Proof Techniques

PeanoBench covers a broad foundational subset of Peano Arithmetic, including theorems and exercises on addition, multiplication, zero, successor, and structural induction over natural numbers. All formulas and corresponding Lean tactics are tightly coupled to canonical themes from NNG4.

Typical Lean tactics exemplified in PeanoBench include:

  • Induction: induction n with d hd
  • Rewrite: rw [add_zero], rw [add_succ]
  • Application of hypothesis: rw [hd]
  • Reflexivity: rfl
  • Case analysis: cases n with d
  • Introduction of witnesses: use d

For example, the proof of commutativity for addition,

a,bN, a+b=b+a\forall a, b \in \mathbb{N},\ a + b = b + a

would be formalized in Lean as follows:

1
2
3
4
5
6
7
8
induction b with d hd
rw [add_zero]
rw [zero_add]
rfl
rw [add_succ]
rw [succ_add]
rw [hd]
rfl
Thus, PeanoBench systematically associates such formal proofs with their corresponding stepwise natural language explanations.

3. Integration in the LeanTutor System

PeanoBench serves as the definitive ground truth dataset for the evaluation and training of modules in LeanTutor, a formally-verified AI tutoring system for mathematical proofs. Its integration occurs across all principal components of LeanTutor:

Autoformalizer & Proof Checker:

LeanTutor’s autoformalization module translates student-provided natural language proof steps into Lean tactics, using PeanoBench’s NL–FL pairs as exemplars for few-shot learning. Each autoformalized step is type-checked in Lean; compilation failures are immediately attributed to the corresponding proof step.

Next-Step Generator:

The system’s next-step generation is informed by the available tactics documented in PeanoBench for the relevant theorem category (“world”) and restricted to those observed within staff solutions. Candidate tactics proposed by the model are validated against Lean’s logic and filtered using correctness and progress criteria.

Pedagogical Feedback Generator:

LeanTutor’s feedback module leverages PeanoBench’s annotations both for accurate error identification and for the creation of pedagogically targeted hints. Feedback is grounded in authentic proof attempts, and includes:

  • Specification of the error type (taxonomy informed by PeanoBench annotation).
  • Context-aware guiding hints, rooted in the specific proof structure and previous student moves.
  • Generation of the next proof step in mathematical natural language, making no reference to Lean syntax or code.

4. Evaluation Metrics and Quantitative Benchmarks

PeanoBench is used as the exclusive evaluation corpus for the major performance metrics of LeanTutor:

  • Tactic-level accuracy: The proportion of NL proof steps correctly translated into the expected Lean tactic, verified via a relaxed exact-matching scheme that compares proof states (modulo variable renaming).
  • Proof-level accuracy: The percentage of entire proofs for which all steps are correctly autoformalized.
  • Error localization accuracy: Rate at which LeanTutor identifies the first erroneous step in an incorrect proof, using ground-truth error positions from PeanoBench.
  • Feedback quality: Human raters assess the system’s natural language hints for accuracy, relevance, clarity, and avoidance of answer leakage.

Reported results based strictly on PeanoBench include: | Metric | Value | |-------------------------------------------|----------| | Correct tactics accuracy | 56.8% | | Proof-level accuracy | 18.0% | | Incorrect proof error identification rate | 30.1% |

On feedback quality, LeanTutor outperforms simple baselines in accuracy (3.7 vs 2.6 on a 5-point scale, for error identification) and relevance, both judged on PeanoBench-derived error cases.

5. Dataset Significance in Educational and AI Research

PeanoBench addresses a gap in prior benchmarking for formally-verified AI tutoring and proof assistant systems:

  • It supports fine-grained, stepwise evaluation of NL-formal proof translation in the context of genuine mathematical content, rather than artificial or software-focused domains.
  • The step-aligned parallel annotation, including both correct and systematically flawed proofs, provides unique leverage for studying autoformalization, proof verification, error diagnosis, and NL feedback.
  • The coverage of standard Peano Arithmetic theorems ensures that the dataset is directly relevant to undergraduate-level mathematics education and proof assistants deployed in similar settings.

A plausible implication is that the design of PeanoBench enables targeted benchmarking and improvement of systems aiming to bridge the gap between formal logic and informal pedagogical guidance.

PeanoBench is referenced as a source for model prompts (“Autoformalizer prompt,” “Feedback Generation prompt”) in the LeanTutor system, dictating both input formats and context windows for few-shot learning and error detection. Statistics on proof categories are documented in the appended “Proof Breakdown by Worlds.” Experimental protocols—including the metric definitions such as proof state comparison and relaxed exact-matching—are standardized and described in the LeanTutor appendix.

Given its role in benchmarking the accuracy of translations, error localization, and pedagogical specificity in AI-driven mathematics tutors, PeanoBench stands as a central resource for subsequent research on natural language interfaces to formal proof assistants, automated feedback strategies, and educational applications of theorem proving.