Papers
Topics
Authors
Recent
Assistant
AI Research Assistant
Well-researched responses based on relevant abstracts and paper content.
Custom Instructions Pro
Preferences or requirements that you'd like Emergent Mind to consider when generating responses.
Gemini 2.5 Flash
Gemini 2.5 Flash 62 tok/s
Gemini 2.5 Pro 49 tok/s Pro
GPT-5 Medium 20 tok/s Pro
GPT-5 High 22 tok/s Pro
GPT-4o 93 tok/s Pro
Kimi K2 199 tok/s Pro
GPT OSS 120B 459 tok/s Pro
Claude Sonnet 4.5 34 tok/s Pro
2000 character limit reached

PeanoBench Dataset Overview

Updated 30 September 2025
  • PeanoBench is a human-annotated dataset that aligns natural language proofs with their formal Lean code, enabling research in automated proof tutoring.
  • It comprises 371 proofs organized into thematic worlds and categories, including staff solutions, correct student styles, and simulated student errors.
  • The dataset’s step-level pairing facilitates precise evaluation of autoformalization fidelity and error detection, aiding in improved pedagogical feedback.

PeanoBench is a human-annotated dataset constructed to support empirical research in automated and AI-assisted mathematical proof tutoring, specifically within Peano Arithmetic. It is designed to facilitate the mapping between natural language mathematical reasoning and formal Lean code, enabling rigorous step-by-step evaluation of proof formalization, error detection, and pedagogical feedback generation.

1. Composition and Structure

PeanoBench contains 371 proofs addressing properties and theorems in Peano Arithmetic, with each proof represented in two modalities: a natural language (NL) version and an equivalent formalization in Lean. The dataset is subdivided into thematically organized “worlds,” such as Addition World and Multiplication World, preserving the progression and difficulty hierarchy inherited from the Natural Numbers Game 4 (NNG4).

Proofs are distributed across three principal categories:

Group Description Count
Staff Solutions Filtered gold-standard proofs from NNG4 75
Correct Student Styles Equation-based and justification-based 150
Incorrect Student Proofs Generated by omitting final proof steps 146

Staff solutions serve as reference, while paired student-style proofs (both equation-based and justification-based) model authentic learner submissions. Incorrect proofs are systematically derived by removing key steps, simulating typical student mistakes—73 for each student persona.

2. Provenance and Role in Education

PeanoBench originates from NNG4, an educational resource of Peano Arithmetic proofs with a didactic focus and “worlds”-based structure. This provenance is retained to ensure pedagogical coherence and graded difficulty. Within the LeanTutor system, PeanoBench is dual-purpose:

  • It provides a controlled corpus for benchmarking LLM-based autoformalization, presenting both canonical formalizations and representative learner errors.
  • Its granular step mapping allows evaluation of AI systems’ capacity not only to validate complete proofs but also to identify errors at the level of individual steps and to provide targeted feedback.

A plausible implication is that this dataset structure supports deployment and assessment of step-wise guidance algorithms, informing both research and instructional design.

3. Evaluation and Metrics

The associated LeanTutor system introduces custom metrics for autoformalization fidelity, operationalized in two phases:

  • Exact tactic-string matching: The autoformalizer output is considered faithful if the produced Lean tactic matches the ground truth string.
  • State-matching (variable-normalized): If string matching fails, the system accepts tactics that transform proof states into those equivalent to ground truth states (modulo variable renaming).

Accuracy is reported at both the tactic and proof level:

  • For correct proofs, all tactics must be faithfully generated to consider a formalization successful.
  • For incorrect proofs, the system must formalize all correct steps and halt at the error (detected via Lean compiler).

Human evaluation of natural language hints generated for erroneous proofs is conducted along four qualitative axes: Accuracy, Relevance, Readability, and Answer Leakage, with LeanTutor outperforming a baseline in accuracy and relevance of error identification and directed hint delivery.

4. Peano Arithmetic Scope

All proofs in PeanoBench address foundational results in Peano Arithmetic, rigorously adhering to axiomatic definitions (including operations and induction principles). The selected theorems, such as the commutativity of addition (a+b=b+aa + b = b + a for all a,bNa, b \in \mathbb N), provide a principled setting for testing proof formalization and inductive reasoning.

This focus ensures that autoformalizers and pedagogical feedback models are evaluated in a mathematically structured domain where logic, induction, and equation manipulation are central. The dataset enables analysis of frequent student errors, notably in inductive proofs and arithmetic operations.

5. Step-Level Pairing and Formalization

A distinctive feature of PeanoBench is the explicit pairing of each informal natural language proof step with its semantically equivalent Lean tactic. This supports fine-grained paper of translation processes and is leveraged by LeanTutor’s autoformalizer module for step-by-step NL-to-code conversion and error localization.

Lean tactics such as induction, rw (rewrite), and rfl (reflexivity) form the core of step-level formalizations, with tactic dictionaries supplied for in-context prompting. This structure enables systematic investigation of reasoning workflows and proof construction patterns in a well-understood arithmetic domain.

6. Mathematical Notation and Translation Dynamics

Sample proofs in PeanoBench employ standard mathematical notation, both informally (e.g., “for all a,bNa, b \in \mathbb N, a+b=b+aa + b = b + a”) and formally via Lean. The dataset exemplifies the translation challenge between conventional symbolic mathematics and Lean’s tactic-based paradigm, underscoring the complexity and necessity of rigorous NL-to-code alignment.

Although the dataset does not contribute novel LaTeX formulas, the presence and pairing of notation highlight the critical role of symbolic equivalence and notation normalization in successful autoformalization.

7. Research Significance and Applications

PeanoBench constitutes a benchmark for assessing AI-driven autoformalization, error detection, and feedback generation systems, with direct applicability in educational domains and mathematically precise reasoning systems. The combination of staff solutions, authentic student proof styles, and systematically constructed mistakes empowers research in AI tutoring, proof assistant development, and formal verification at the intersection of symbolic and natural language processing.

A plausible implication is that PeanoBench informs the design and empirical validation of step-level, instruction-focused models, supporting advances in explainable AI, formal methods, and mathematical pedagogy.

Forward Email Streamline Icon: https://streamlinehq.com

Follow Topic

Get notified by email when new papers are published related to PeanoBench Dataset.

Don't miss out on important new AI/ML research

See which papers are being discussed right now on X, Reddit, and more:

“Emergent Mind helps me see which AI papers have caught fire online.”

Philip

Philip

Creator, AI Explained on YouTube