Autoformalization Models

Updated 9 November 2025

Autoformalization models are systems that translate human-written informal math proofs into precise, machine-checkable formats suitable for proof assistants.
They combine sequence-to-sequence neural architectures with retrieval augmentation and multi-task learning to enhance the synthesis of formal statements and proofs.
Their application accelerates the construction of formal libraries and bolsters AI trustworthiness by ensuring semantic fidelity in mathematical reasoning.

Autoformalization models address the transformation of informal mathematical statements and proofs, typically authored in natural language by human mathematicians, into precise, machine-checkable formal representations suitable for proof assistants such as Lean, Coq, and Isabelle. This process bridges the gap between natural mathematical exposition and the requirements of automated theorem proving (ATP), and has become foundational both for scaling the construction of formal mathematical libraries and for improving the trustworthiness of AI-generated mathematical reasoning (Weng et al., 29 May 2025).

1. Core Definitions and Significance

Autoformalization is defined as the mapping

$x_{\text{informal}} \mapsto y_{\text{formal}},$

where $x$ is a human-written mathematical proposition or proof, and $y$ is a formal script that type-checks—and ideally proves—the stated intention within a proof assistant. Formally, for a pair $(x, y)$ , $y$ must satisfy both syntactic correctness (type-checks in the target formal language) and semantic fidelity (captures precisely the mathematical content of $x$ ) (Weng et al., 29 May 2025).

The motivation for autoformalization is multifaceted:

Mathematics: It accelerates creation of machine-checked libraries across core domains such as geometry, algebra, and topology, bypassing the labor-intensive process of manual encoding.
Artificial Intelligence: It enables LLMs to produce outputs verifiable with formal logic, serving as a mechanism to ground natural language reasoning and reduce hallucinations, thereby increasing trust in AI-generated arguments.

2. Model Architectures and Algorithmic Frameworks

Most current models adopt a sequence-to-sequence paradigm. Given tokenized input $x = (x_1, \ldots, x_m)$ , the model generates output $y = (y_1, \ldots, y_T)$ according to:

$p_\theta(y|x) = \prod_{t=1}^T p_\theta(y_t | x, y_{<t}),$

with parameters $\theta$ trained to minimize cross-entropy over a parallel corpus $\{(x^{(i)}, y^{(i)})\}$ :

$L(\theta) = -\sum_{i} \sum_{t=1}^{T_i} \log p_\theta(y_t^{(i)} | x^{(i)}, y_{<t}^{(i)}).$

Enhancements include:

Retrieval-Augmentation: Embedding-based retrieval of relevant formal lemmas or definitions appended to the model context improves both statement and proof synthesis, especially for domains with rich libraries (e.g., Mathlib for Lean) (Weng et al., 29 May 2025).
Multi-Task and Feedback Integration: Combining objectives, e.g., training on both informal $\rightarrow$ formal and formal $\rightarrow$ informal translation, or incorporating prover feedback.
Fine-Tuning and In-Context Learning: Domain adaptation via fine-tuning on synthetic or curated formalization corpora (e.g., FormL4, Herald); effective use of few-shot learning, where 5–10 prompt exemplars yield substantial accuracy on Olympiad and undergraduate-level theorems (Weng et al., 29 May 2025).

3. Data Collection, Synthetic Augmentation, and Formal Representations

Data Sources

Formal Corpora: Mathlib4 (Lean), Metamath, Mizar, Archive of Formal Proofs (AFP/Isabelle).
Human-Written Benchmarks: miniF2F (488 Olympiad-level), ProofNet (371 undergraduate theorems), PutnamBench (657 advanced competition problems), arXiv2Formal (50 research-level statements).

Synthetic Data Augmentation

Back-Translation: Formal statements are converted into informal English via LLMs, and then re-formalized, producing additional aligned pairs.
Cross-Language Pooling: Parallel corpora are synthesized by cloning statements across multiple proof assistants to encourage lexical and structural diversity.

Tokenization and Grammar

Subword Units: BPE or SentencePiece for natural language and code.
Grammar-Aware Tokens: Explicit marking of tactic syntax, identifiers, etc.
Abstract Syntax Trees (ASTs): Parsed from the target proof assistant, sometimes transformed by grammar formalizations (e.g., GF-Lean), and linearized to ensure syntactic validity.

4. Evaluation Strategies and Benchmarking

Performance is assessed across several axes:

Syntactic Accuracy: Exact token match, AST equivalence, and BLEU/ROUGE scores between generated/formalized and reference outputs. Metrics such as BEqL or BEq+ allow normalization by type-checking and small permissible rewrites (Weng et al., 29 May 2025).
Proof Success Rate: Percentage of statements that a downstream automated prover (e.g., tactic-guided BFS Prover, DeepSeek-Prover) can prove aftter formalization; peak performance on miniF2F is approximately 90%, but PutnamBench remains substantially more challenging (20–30%).
Semantic Equivalence: Embedding-based or ATP-based scores confirm alignment of premises and conclusions between generated and gold versions, measuring whether $S_{\text{generated}} \vDash S_{\text{reference}}$ .
Human Evaluation: Gold standard for high-difficulty or research-level math, where expert annotators judge whether formal statements are faithful interpretations of the original informal problem.

Table: Key Open-Source Datasets

name	size	domain	difficulty
miniF2F	488	Olympiad/high-school	moderate–hard
ProofNet	371	undergraduate math	medium
PutnamBench	657	Putnam competition	hard
LeanEuclid	173	Euclidean geometry	medium
FormL4	17,137	synthetic (Mathlib4)	varied
LeanDojo	122,517	Mathlib4 corpus	assorted
Herald	624,436	Lean4/NL statements	advanced
STP_Lean_0320	3.26M	Lean sources	large

5. Representative Methods and Exemplars

Autoformalization systems now commonly deploy hybrid pipelines amalgamating LLM sequence-to-sequence generation, retrieval augmentation, and formal grammar enforcement, as in the following high-level stages:

Few-shot prompted translation of informal to formal statements (e.g., “Prove that if $n$ is even, $n^2$ is even” $\rightarrow$ corresponding Lean theorem, see (Weng et al., 29 May 2025)).
Chain-of-thought prompting: Numbered steps guide LLMs to synthesize structured proofs as sequences of tactics.
Retrieval of relevant lemmas/definitions: Formal context expanded with theorems from Mathlib or other formal corpora.

Example 1 (Number Theory, Lean 4):

theorem even_square {n : ℕ} (h : even n) : even (n ^ 2) := by
  rcases h with ⟨k, rfl⟩
  use k * k
  show (k * k) * 2 = n ^ 2
  simp [pow_two]

Example 2 (Set Theory, Coq):

1
2
3

Theorem inter_comm (A B : Ensemble U) : A ∩ B = B ∩ A. Proof.
  unfold Intersection. apply Extensionality_Ensembles; split; intros x [H1 H2].
  split; assumption. Qed.

Example 3 (Euclidean Geometry, Lean 4):

theorem isosceles_of_base_angles {A B C : Point}
  (h : angle B A C = angle C A B) : dist B A = dist C A := by
  -- uses formal library E for Euclidean axioms
  sorry

6. Current Limitations and Research Challenges

The autoformalization field faces several intrinsic and technical obstacles:

Data Scarcity: High-quality aligned NL–formal corpora are concentrated in well-studied areas, with limited coverage beyond standard undergraduate-level topics.
Domain Generalization: LLMs overfit to superficial cues in specific mathematical areas; cross-domain transfer and meta-learning approaches are not yet fully mature.
Scalability: Very large or deeply nested proofs stress inference-time memory, and sequence generation limitations hinder formalization completeness for complex arguments.
Verifiability and Interpretability: Achieving type-checked outputs does not guarantee semantic correctness; richer metrics and interactive evaluation tools (e.g., LeanDojo, FVEL) are needed to close the gap between syntax and genuine meaning.
Creativity and Abstraction: Models are currently limited to formalizing known mathematics—with future directions including the conjecture of new lemmas, suggesting proof strategies, and automating discovery of generalizations.
Human–AI Collaboration: Combined workflows, where LLMs suggest next steps and humans provide top-level strategy or abstraction, are likely to yield the most robust formalization pipelines moving forward.

7. Outlook and Future Directions

Today’s autoformalization models, combining seq2seq LLMs, retrieval, synthetic data generation, and formal-grammar pipelines, achieve up to 90% success rates on benchmarks such as miniF2F (Weng et al., 29 May 2025). However, extending these results to broader mathematical domains, non-mathematical formal reasoning (e.g. in law or scientific protocol), and achieving true end-to-end verifiability will require:

Expanding synthetic and human-in-the-loop data augmentation for rare domains.
Curriculum learning approaches that scaffold models from elementary to frontier mathematics.
Advanced interactive debugging infrastructures for error localization and correction at both syntactic and semantic levels.
Richer, multi-modal integration (including diagrams and geometric constructions).
Universal adapters and cross-domain architectures to exploit generalizable reasoning patterns.

Autoformalization stands as a foundational technology for formal mathematics and trustworthy AI; overcoming current data, domain, and semantic bottlenecks is central to its future evolution.

PDF Markdown Chat (Pro)

References (1)

Autoformalization in the Era of Large Language Models: A Survey (2025)

Follow Topic

Get notified by email when new papers are published related to Autoformalization Models.