Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
157 tokens/sec
GPT-4o
8 tokens/sec
Gemini 2.5 Pro Pro
46 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

LeanTutor: Verified AI Proof Tutoring

Updated 30 June 2025
  • LeanTutor is a formally-verified AI tutoring system designed for interactive natural language mathematical proof learning.
  • It integrates modules for autoformalization, proof checking, next-step generation, and natural language feedback using Lean verification.
  • Empirical evaluations on PeanoBench demonstrate superior accuracy and actionable instructional guidance compared to LLM-only baselines.

LeanTutor is a formally-verified artificial intelligence tutoring system targeting the interactive learning of mathematical proofs in natural language. It is composed of integrated modules that translate student natural language (NL) proof steps into Lean proof scripts, verify correctness using formal proof compilation, and generate both logical next tactics and pedagogically-informed feedback. The system is evaluated on the PeanoBench dataset comprising Peano arithmetic proofs and demonstrates empirically superior performance over LLM-only baselines in both autoformalization accuracy and instructional guidance.

1. Modular System Architecture

LeanTutor consists of three principal, tightly integrated modules:

  1. Autoformalizer/Proof-Checker: Translates NL proof steps provided by students into Lean tactics via LLM prompting (“autoformalization”) and checks correctness by attempting to compile the Lean code in the Lean proof assistant. For each step, the system processes, formalizes, and immediately checks success. On compilation failure—either through a tactic error or unsolved goal—the first erroneous step is identified, with further translation and verification halted at that point.
  2. Next-Step Generator: Upon identifying an incorrect or incomplete proof, this component produces valid next Lean tactics for the student. LLM-based candidate generation is combined with formal proof search, filtering out invalid directions (e.g., using the theorem being proved as a lemma or revisiting prior states). The approach involves generating up to 12 candidate tactics, ranking by likelihood, and using bounded proof search (up to depth 8) to select the next tactic that provably advances the proof.
  3. Natural Language Feedback Generator: Produces instructional feedback drawing on Lean proof state, error messages, and the recommended next tactic. It uses pedagogically motivated templates to provide error identification, guiding questions, hints, and (when explicitly requested) a natural-language explanation of the next formal step, eschewing Lean code for student accessibility.

This architecture allows LeanTutor to interact dialogically: students provide stepwise natural language arguments, each step is formally translated and checked, errors are localized with precise feedback, and next-step guidance is offered grounded in both the logical and pedagogical context.

2. Autoformalization and Verification

The autoformalizer operates through a context-rich prompting scheme:

  • Input context: Each prompt to the LLM includes the theorem statement both in NL and formal Lean, lists of all available theorems and tactics with NL explanations, and five example step-level translations (few-shot). For increased performance, staff solution proofs can be included.
  • Verifying correctness: After formalization, compiled Lean code is accepted if only unsolved goals remain; any Lean compilation error (e.g., “unknown tactic,” “unexpected identifier”) halts progress and signals an incorrect step.

Evaluation Metrics

  • Faithful autoformalization uses a two-phase criterion: (1) first matching tactics by string, and (2) if non-matching, comparing Lean proof state for logical equivalence, disregarding variable naming.
  • Step-level evaluation enables precise error localization, which underpins targeted instructional feedback.

Empirically, the inclusion of staff solution proofs (“ground truth” expert demonstrations) improves correct tactic generation to 56.8% (±3.2%) and accurate identification of the incorrect step in 30.1% (±7.4%) of incorrect proofs on PeanoBench.

3. Next-Step Generation

The next-step generator supplements formal verification with LLM-guided proof search:

  • Given the current proof state up to the last correct step, the LLM generates a ranked list of candidate Lean tactics contextualized by available theorems and tactics (for the appropriate “world” or corpus).
  • Each candidate is appended and compiled in Lean; only those passing compilation and progressing towards the goal are considered.
  • Proof search is bounded both to avoid cycles (previous proof states) and forbidden actions (such as importing the very theorem being proved).
  • The first candidate leading to proof completeness is selected as the recommended next tactic.

This design ensures next-step guidance is both linguistically and formally aligned with the exact proof trajectory, avoiding unproven or logically spurious directions.

4. Feedback and Instructional Guidance

The feedback generator leverages formal data and pedagogical patterns:

  • Error identification: Cites the specific logical mistake (e.g., neglecting to invoke the inductive hypothesis in induction proofs).
  • Guiding questions/hints: Asks leading, structure-revealing questions that foster mathematical reasoning, such as “What theorem lets you transform a + (b + succ d)?”.
  • Bottom-out hints: Upon student request, provides the next necessary transformation described in clear mathematical language.
  • Prompting constraints: Feedback is generated to avoid referencing Lean tactics or syntax, instead using mathematical equations, definitions, and pedagogically sound phrasing.

In human evaluation, LeanTutor’s feedback demonstrates higher accuracy (3.7/5), relevance (3.6/5), and readability (4.7/5) than LLM-only baselines, with controlled answer leakage except where full next-step explanations are specifically requested.

5. Empirical Evaluation (PeanoBench)

LeanTutor is evaluated using PeanoBench, a dataset of 371 Peano arithmetic proofs annotated stepwise in both natural language and Lean. Main experimental findings are as follows:

Experiment Correct Tactics Correct Proofs Incorrect Proofs
Baseline 32.9% (±3.1%) 6.7% (±4.0%) 14.4% (±5.7%)
Baseline + Staff Solution 56.8% (±3.2%) 18.0% (±6.1%) 30.1% (±7.4%)
Baseline + Staff Sol., Whole Proof 51.8% (±3.3%) 26.7% (±7.0%) 21.9% (±6.7%)

Stepwise, staff-solution-guided formalization produces the most faithful Lean tactic generation and error identification.

Feedback Type Accuracy Relevance Readability Answer Leakage
Baseline 2.6 2.7 4.8 4.7
LeanTutor 3.7 3.6 4.7 4.9

LeanTutor’s natural language feedback outperforms baseline LLM-generated hints in accuracy and relevance, without sacrificing readability.

6. Design Innovations and Future Directions

LeanTutor integrates LLM reasoning with theorem-prover verification at the granularity of single proof steps, providing a pipeline for pedagogically rigorous, formally reliable mathematical proof tutoring. Key innovations include:

  • Stepwise, context-rich autoformalization that balances LLM flexibility with Lean’s proof rigor.
  • Formal verification at each stage of student progress, with immediate and targeted feedback.
  • LLM-based next-step generation constrained and validated through proof search and logical admissibility.
  • Feedback that translates formal proof concepts into accessible, instructional natural language.

Future developments aim to reduce the need for staff-authored solutions (greater model autonomy), improve the robustness of autoformalization in varied domains, support small-model deployment for scalability, and expand evaluation into more complex mathematical domains and broader, less-structured proof-writing tasks.

LeanTutor thus constitutes a significant advancement in the integration of LLMs and theorem proving for education, establishing concrete benchmarks for correctness, instructional value, and ability to advance student proof construction in interactive tutoring contexts.