PeanoBench Dataset Overview
- PeanoBench is a human-annotated dataset that aligns natural language proofs with their formal Lean code, enabling research in automated proof tutoring.
- It comprises 371 proofs organized into thematic worlds and categories, including staff solutions, correct student styles, and simulated student errors.
- The dataset’s step-level pairing facilitates precise evaluation of autoformalization fidelity and error detection, aiding in improved pedagogical feedback.
PeanoBench is a human-annotated dataset constructed to support empirical research in automated and AI-assisted mathematical proof tutoring, specifically within Peano Arithmetic. It is designed to facilitate the mapping between natural language mathematical reasoning and formal Lean code, enabling rigorous step-by-step evaluation of proof formalization, error detection, and pedagogical feedback generation.
1. Composition and Structure
PeanoBench contains 371 proofs addressing properties and theorems in Peano Arithmetic, with each proof represented in two modalities: a natural language (NL) version and an equivalent formalization in Lean. The dataset is subdivided into thematically organized “worlds,” such as Addition World and Multiplication World, preserving the progression and difficulty hierarchy inherited from the Natural Numbers Game 4 (NNG4).
Proofs are distributed across three principal categories:
Group | Description | Count |
---|---|---|
Staff Solutions | Filtered gold-standard proofs from NNG4 | 75 |
Correct Student Styles | Equation-based and justification-based | 150 |
Incorrect Student Proofs | Generated by omitting final proof steps | 146 |
Staff solutions serve as reference, while paired student-style proofs (both equation-based and justification-based) model authentic learner submissions. Incorrect proofs are systematically derived by removing key steps, simulating typical student mistakes—73 for each student persona.
2. Provenance and Role in Education
PeanoBench originates from NNG4, an educational resource of Peano Arithmetic proofs with a didactic focus and “worlds”-based structure. This provenance is retained to ensure pedagogical coherence and graded difficulty. Within the LeanTutor system, PeanoBench is dual-purpose:
- It provides a controlled corpus for benchmarking LLM-based autoformalization, presenting both canonical formalizations and representative learner errors.
- Its granular step mapping allows evaluation of AI systems’ capacity not only to validate complete proofs but also to identify errors at the level of individual steps and to provide targeted feedback.
A plausible implication is that this dataset structure supports deployment and assessment of step-wise guidance algorithms, informing both research and instructional design.
3. Evaluation and Metrics
The associated LeanTutor system introduces custom metrics for autoformalization fidelity, operationalized in two phases:
- Exact tactic-string matching: The autoformalizer output is considered faithful if the produced Lean tactic matches the ground truth string.
- State-matching (variable-normalized): If string matching fails, the system accepts tactics that transform proof states into those equivalent to ground truth states (modulo variable renaming).
Accuracy is reported at both the tactic and proof level:
- For correct proofs, all tactics must be faithfully generated to consider a formalization successful.
- For incorrect proofs, the system must formalize all correct steps and halt at the error (detected via Lean compiler).
Human evaluation of natural language hints generated for erroneous proofs is conducted along four qualitative axes: Accuracy, Relevance, Readability, and Answer Leakage, with LeanTutor outperforming a baseline in accuracy and relevance of error identification and directed hint delivery.
4. Peano Arithmetic Scope
All proofs in PeanoBench address foundational results in Peano Arithmetic, rigorously adhering to axiomatic definitions (including operations and induction principles). The selected theorems, such as the commutativity of addition ( for all ), provide a principled setting for testing proof formalization and inductive reasoning.
This focus ensures that autoformalizers and pedagogical feedback models are evaluated in a mathematically structured domain where logic, induction, and equation manipulation are central. The dataset enables analysis of frequent student errors, notably in inductive proofs and arithmetic operations.
5. Step-Level Pairing and Formalization
A distinctive feature of PeanoBench is the explicit pairing of each informal natural language proof step with its semantically equivalent Lean tactic. This supports fine-grained paper of translation processes and is leveraged by LeanTutor’s autoformalizer module for step-by-step NL-to-code conversion and error localization.
Lean tactics such as induction
, rw
(rewrite), and rfl
(reflexivity) form the core of step-level formalizations, with tactic dictionaries supplied for in-context prompting. This structure enables systematic investigation of reasoning workflows and proof construction patterns in a well-understood arithmetic domain.
6. Mathematical Notation and Translation Dynamics
Sample proofs in PeanoBench employ standard mathematical notation, both informally (e.g., “for all , ”) and formally via Lean. The dataset exemplifies the translation challenge between conventional symbolic mathematics and Lean’s tactic-based paradigm, underscoring the complexity and necessity of rigorous NL-to-code alignment.
Although the dataset does not contribute novel LaTeX formulas, the presence and pairing of notation highlight the critical role of symbolic equivalence and notation normalization in successful autoformalization.
7. Research Significance and Applications
PeanoBench constitutes a benchmark for assessing AI-driven autoformalization, error detection, and feedback generation systems, with direct applicability in educational domains and mathematically precise reasoning systems. The combination of staff solutions, authentic student proof styles, and systematically constructed mistakes empowers research in AI tutoring, proof assistant development, and formal verification at the intersection of symbolic and natural language processing.
A plausible implication is that PeanoBench informs the design and empirical validation of step-level, instruction-focused models, supporting advances in explainable AI, formal methods, and mathematical pedagogy.