Autoformalization & Theorem Proving
- Autoformalization is the process of converting informal mathematical language into formal, machine-checkable code using proof assistants like Lean, Coq, or Isabelle/HOL.
- Iterative feedback from syntactic checks and semantic evaluations ensures that formalizations achieve both correctness and alignment with the original mathematical intent.
- Multi-agent architectures leveraging LLM transformations, reinforcement learning, and tool integration drive significant improvements in pass rates and verification accuracy.
Autoformalization is the automated process of translating informal mathematical language—typically natural language or LaTeX-styled mathematical text—into machine-verifiable code in the formal language of a proof assistant such as Lean, Coq, or Isabelle/HOL. When paired with theorem proving, autoformalization both produces formal theorems and provides the necessary interface for interactive and automated proof search, thereby enabling end-to-end verification of mathematical content. The rapid evolution of LLMs and integrated neuro-symbolic pipelines over the last several years has driven major advances in this area, yielding powerful systems that combine modular LLM-based transformation with rigorous symbolic verification, iterative refinement, and rich evaluation protocols.
1. Definitions and Problem Setting
Autoformalization is formally defined as the transformation
where denotes the space of informal or semi-formal mathematical inputs (natural language, LaTeX, textbook statements), and is the set of well-formed, machine-checkable formal outputs in a given proof assistant’s language (e.g., Lean, Isabelle/HOL, Coq) (Mensfelt et al., 11 Sep 2025). The resulting formal objects must (i) pass the syntactic checks of the proof assistant (parse, elaborate, type check) and (ii) achieve a semantic equivalence criterion (faithful representation of the mathematical meaning).
In the context of theorem proving, autoformalization sits at the front of the machine reasoning stack: producing formal definitions, statements, and goals that can be dispatched to downstream proof search, tactic generation, premise retrieval, or automated deduction engines. For maximal rigor and scalability, this translation must account for the idiosyncrasies of mathematical English, domain-specific notations, and the typical presence of implicit background assumptions (Li et al., 2024, Weng et al., 29 May 2025).
2. Architectures and Multi-Agent Systems
Classic neural autoformalization models adopted sequence-to-sequence RNNs or Transformer architectures to map informal-to-formal in a single pass. However, the state of the art has moved toward modular, multi-agent, and iterative LLM-driven systems that decompose the transformation pipeline into specialized roles, each responsible for distinct facets of the conversion (Zhang et al., 10 Oct 2025).
MASA (Multi-Agent System for Autoformalization) exemplifies this paradigm by orchestrating several agent types:
- AutoformalizationAgent: Generates zero-to-few shot formalization drafts from natural language.
- HardCritiqueAgent: Submits code to the theorem prover (Lean4, Isabelle/HOL), returns syntactic correctness and error details.
- FormalRefinementAgent: Uses LLMs to repair code in response to prover error messages.
- SoftCritiqueAgent: Judges semantic alignment (faithfulness, completeness) using LLM-based evaluators.
- InformalRefinementAgent: Improves semantic alignment based on soft-critique output.
- ToolAgents: Apply deterministic edits (e.g., adding missing imports).
- KnowledgeBase and Retriever: Provide retrieval-augmented context (imports/theorems).
- TheoremProver modules: Handle verification and error feedback from proof assistants.
Agents interact in a standardized, pipeline-driven protocol, passing JSON-like message records that encapsulate the informal input, formalization, correctness flags, and both syntactic and semantic critique. Iteration continues with alternation of hard- and soft-corrective feedback until a formalization passes all checks or a maximum iteration budget is reached (Zhang et al., 10 Oct 2025).
3. Iterative Refinement, Feedback, and Quality Optimization
Advanced autoformalization systems rely on iterative, feedback-driven self-improvement to achieve high-quality formalizations. Two core dimensions of feedback are exploited:
- Syntactic feedback: Provided by the proof assistant’s parser/type-checker (rejecting ill-formed or type-incorrect output).
- Semantic feedback: Provided by LLM-based judges, which assess whether the formal code faithfully captures the informal meaning, typically in terms of alignment faithfulness (AF) and formalization correctness (FC).
Monotonic Reference-Free Refinement introduces a masked composite objective over four independent quality metrics—Formal Validity (prover acceptance), Logical Preservation (semantic consistency), Mathematical Consistency, and Formal Quality—optimized by alternating generator agents (LLMs with various roles) that drive improvement on specific dimensions according to an adaptive responsiveness map (Zhang et al., 30 Jan 2026).
At each iteration the process only accepts candidate formalizations whose aggregate score (necessarily including formal validity as a hard mask) strictly improves over the previous best; this enforces certified monotonic progress and guarantees convergence. The empirical results show simultaneous increases in both syntactic pass rate (up to 93.44% on miniF2F) and composite semantic score (to 78.22%)—substantially outperforming naive single-pass or non-acceptance-gated refinement (Zhang et al., 30 Jan 2026).
4. Evaluation Metrics and Benchmarks
Robust evaluation in autoformalization is multi-layered, consisting of:
- Syntactic correctness (Pass Rate): Fraction of outputs accepted by the proof assistant parser/type-checker.
- n-gram-based metrics (BLEU, ChrF, RUBY): Proxy for overlap with ground-truth formalizations, used as a coarse measure of semantic similarity (Zhang et al., 10 Oct 2025, Azerbayev et al., 2023).
- LLM-based Judgment (AF/FC): Alignment faithfulness and formalization correctness rated by strong LLMs (Zhang et al., 10 Oct 2025).
- Composite/fuzzy scoring: Finer-grained metrics such as LeanScorer (Sugeno fuzzy integral) (Xuejun et al., 8 Jun 2025), which aggregate subtask and subcomponent alignment for nuanced acceptance.
- Automated proof synthesis: Full downstream proof search/validation on the generated formal statements.
- GTED (Generalized Tree Edit Distance): Tree-structural distance between formal outputs and gold benchmarks, interpolating between weak string-based and overly strict proof equivalence (Liu et al., 10 Jul 2025).
Key benchmarks include miniF2F (Olympiad-level, Lean/Isabelle), ProofNet (undergraduate, Lean3), and large synthetic or curriculum-augmented sets (e.g., Numina-ATF, uproof) (Azerbayev et al., 2023, Guo et al., 8 Oct 2025, Huang et al., 26 Aug 2025).
Representative results from MASA indicate ~86% pass rate (syntactic acceptance) on miniF2F via hard-critique and formal refinement. Iterative self-refinement yields 61.9% of sampled outputs both syntactically correct and semantically aligned with GPT-4.1-mini after three iterations, greatly exceeding open-model baselines (Zhang et al., 10 Oct 2025).
5. Autoformalization–Theorem Proving Integration
State-of-the-art systems bind autoformalization tightly to proof search and verification, closing the loop on both data generation and formal reasoning:
- Theorem Prover as a Judge (TP-as-a-Judge): The autoformalizer’s output is not only assessed by syntactic correctness, but each intermediate reasoning step is formally verified in an interactive theorem prover (e.g., Lean). This feedback is recycled for RL training, supervised fine-tuning, and preference optimization, yielding further accuracy gains (Leang et al., 18 Feb 2025).
- Autoformalizer with Tool Feedback (ATF): Integrates Lean compiler calls for syntax, and ensemble LLMs for semantic checks. The model protocol alternates generations and tool calls, refining until both checks succeed, with expert iteration and preference optimization pushing performance further (Guo et al., 8 Oct 2025).
- Reinforcement Learning (GRPO, DPO, RLTPF): Directly incorporates prover/LMM-based binary or graded reward for formalization, enabling data-efficient improvement even with little or no supervised data (Xuejun et al., 8 Jun 2025, Huang et al., 26 Aug 2025, Leang et al., 18 Feb 2025).
- End-to-End Proof Pipelines: Multi-stage systems (e.g., Mathesis) couple an LLM-based autoformalizer (RL-optimized) with a formal proof generator trained by expert iteration, achieving pass@k rates up to 96% on miniF2F and 71% on the challenging Gaokao-Formal dataset (Xuejun et al., 8 Jun 2025).
- Benchmark-Oriented Data Synthesis: Frameworks such as HunyuanProver leverage high-throughput autoformalization for large-scale dataset construction, then apply iterative proof search with advanced search heuristics and critics to build a self-improving prover (Li et al., 2024).
6. Challenges, Limitations, and Future Research
Despite substantial gains, several challenges persist:
- Implicit Assumptions and Context Omission: Mathematical English omits hypotheses and background, often requiring library-specific imports, typeclass arguments, and a high degree of implicit domain expertise. Library mismatches between NL and formal notation are a persistent obstacle (Mensfelt et al., 11 Sep 2025, Azerbayev et al., 2023).
- Semantic Evaluation and Trustworthiness: n-gram and pass-rate proxies are only weakly correlated with true semantic correctness. Full semantic checks require proof synthesis or robust proof-irrelevant alignment metrics (e.g. GTED), but these remain computationally expensive (Liu et al., 10 Jul 2025, Li et al., 2024).
- Scalability of Retrieval and Orchestration: As theorem libraries scale into tens of millions, efficient retrieval (for lemma and import augmentation) and dynamic agent orchestration become key bottlenecks (Zhang et al., 10 Oct 2025).
- End-to-End Automation for Research-Level Mathematics: Research-level statements often depend on complex, domain-specific definitions and extensive context, rendering direct autoformalization brittle without modularization (e.g. unlinked formalization, entity linking, type correction) (Patel et al., 2023).
- Human–AI Collaboration: Current pipelines are mostly static and non-interactive; adaptive, user-steerable autoformalization remains an open research direction (Mensfelt et al., 11 Sep 2025).
- Tool Integration and Latency: Frequent theorem prover invocation and LLM-based semantic checking introduce significant computation cost; distillation and lightweight surrogate models are under investigation (Guo et al., 8 Oct 2025).
- Robustness across Formal Systems: Most models are specific to a single proof assistant (Lean, Isabelle, Coq); generalization and transfer across systems is largely unsolved (Weng et al., 29 May 2025, Mensfelt et al., 11 Sep 2025).
Promising research avenues include interactive refinement, cross-assistant transfer, deep reinforcement learning from tool and prover feedback, more principled composite quality scoring, and scalable retrieval augmentation.
7. Summary Table of Methodological Advances
| System / Approach | Key Innovation | Outcome/Result |
|---|---|---|
| MASA (Zhang et al., 10 Oct 2025) | Modular multi-agent pipeline | ~86% syntactic pass, 61.9% aligned (3 iters) |
| Monotonic Reference-Free | Acceptance-gated, multi-metric refinement | 93.4% pass, 78.2% composite (miniF2F) |
| ATF (Guo et al., 8 Oct 2025) | Compiler+multi-LLM iterative tool feedback | +9–29 pp semantic gain, human-aligned |
| Mathesis (Xuejun et al., 8 Jun 2025) | RL-trained autoformalizer + expert-iteration prover | +22% pass on Gaokao-Formal, 64% miniF2F@32 |
| TP-as-a-Judge (Leang et al., 18 Feb 2025) | Theorem prover as RL preference oracle | +20% acc. MultiArith, up to 87% Lean exec. |
| FormaRL (Huang et al., 26 Aug 2025) | RL with only unlabeled data | 26.15% pass@1 ProofNet (vs. 4.04% SFT) |
| SITA (Li et al., 13 Nov 2025) | Structure-to-instance, template-guided Lean formalization | 57.1% full file success (R1), 76.9 MV score |
References
- (Zhang et al., 10 Oct 2025) MASA: LLM-Driven Multi-Agent Systems for Autoformalization
- (Guo et al., 8 Oct 2025) Autoformalizer with Tool Feedback
- (Mensfelt et al., 11 Sep 2025) Towards a Common Framework for Autoformalization
- (Huang et al., 26 Aug 2025) FormaRL: Enhancing Autoformalization with no Labeled Data
- (Liu et al., 10 Jul 2025) Generalized Tree Edit Distance (GTED): A Faithful Evaluation Metric for Statement Autoformalization
- (Zhang et al., 30 Jan 2026) Monotonic Reference-Free Refinement for Autoformalization
- (Xuejun et al., 8 Jun 2025) Mathesis: Towards Formal Theorem Proving from Natural Languages
- (Li et al., 13 Nov 2025) SITA: A Framework for Structure-to-Instance Theorem Autoformalization
- (Li et al., 2024) HunyuanProver: A Scalable Data Synthesis Framework and Guided Tree Search
- (Azerbayev et al., 2023) ProofNet: Autoformalizing and Formally Proving Undergraduate-Level Mathematics
- (Patel et al., 2023) A New Approach Towards Autoformalization
- (Li et al., 2024) A Survey on Deep Learning for Theorem Proving
- (Weng et al., 29 May 2025) Autoformalization in the Era of LLMs: A Survey
- (Zhou et al., 2024) Don't Trust: Verify -- Grounding LLM Quantitative Reasoning with Autoformalization
- (Wu et al., 2022) Autoformalization with LLMs
This confluence of LLM-driven transformation, iterative feedback, modular agent design, and rigorous tool integration currently defines the research frontier in autoformalization and theorem proving. The field continues to advance rapidly as both data and algorithms improve.