Papers
Topics
Authors
Recent
2000 character limit reached

Full-Theorem Autoformalization

Updated 6 February 2026
  • Full-theorem autoformalization is the process of converting entire mathematical theorems and proofs from natural language into formal code, ensuring both syntactic and semantic validity.
  • Advanced pipelines decompose the task into unlinked formalization, entity linking, and type adjustment, enabling one-click verification in systems like Lean, Coq, or Isabelle.
  • The approach leverages dependency graphs, reinforcement learning, and neuro-symbolic strategies to enhance precision and minimize manual intervention in research-level mathematics.

Full-theorem autoformalization is the automatic translation of entire mathematical theorems (and in general, their proofs) from informal, human-written natural language into machine-checkable code accepted by interactive theorem provers (ITPs) such as Lean, Coq, or Isabelle. The goal is to produce, in one pass, a complete formal statement—including all quantifiers, assumptions, domain annotations, and the conclusion—as well as a verified proof script that passes both syntactic and semantic validation in the chosen formal system. This task is distinguished by its end-to-end automation, encompassing both intricate semantic mapping and rigorous type-checking, and is a foundational challenge in the advancement of automated theorem proving and AI-driven mathematics.

1. Problem Definition, Scope, and Motivation

Full-theorem autoformalization requires the system to output a formal theorem statement ss and a complete formal proof pp in a proof assistant, such that pp is mechanically verified by the assistant and ss is semantically equivalent to the original human text. The ambition is to replace the labor-intensive process of formalizing advanced mathematics—previously exemplified by multi-person, multi-year efforts (e.g., the formalization of the Liquid Tensor Experiment)—with scalable, automated methods that can efficiently process research-level mathematical text (Patel et al., 2023, Weng et al., 29 May 2025, Cabral et al., 13 Oct 2025).

Key challenges include:

  • Natural language ambiguity and context dependence: Mathematical texts often rely on implicit background, context, and definitions not explicitly spelled out.
  • Scarcity of parallel data: Large, high-quality corpora of aligned informal (NL) and formal (ITP code) text at the research level are rare.
  • Entanglement of naming, linking, and typing: The process is tightly coupled; every concept must be defined, correctly linked to library definitions, and type-checked.

Full-theorem autoformalization aims for one-click import→formalize→verify, producing a certified statement and proof requiring no additional human correction (Patel et al., 2023, Weng et al., 29 May 2025).

2. Architectures and Methods for Full-Theorem Autoformalization

Several approaches have been advanced for full-theorem autoformalization, each contributing specific architectural innovations.

A representative and influential paradigm decomposes the end-to-end task into three sequential phases:

  1. Unlinked Formalization: A sequence-to-sequence model takes a (usually LaTeX) theorem statement stripped of informal context and outputs a syntactically correct formal statement in Lean, but with all non-literal terms replaced by local placeholder names. Example:

1
theorem thm_1 (x y : α) (hα : α ≃* real) (hxy : x > y) : x * x > y * y := by admit
Here, α\alpha is a placeholder type, not yet recognized as real.

  1. Entity Linking: This stage maps each placeholder (type, function, relation, constant) to a concrete symbol in the assistant’s standard library (e.g., mathlib for Lean), utilizing a blend of surface-form matching, embedding-based similarity, and reranking heuristics:

1
theorem thm_1 (x y : ℝ) (hxy : x > y) : x * x > y * y := by admit

  1. Type Adjustment: The resulting Lean (or other ITP) code is subject to type-checking; further edits (argument insertion, coercions, or fixed type signatures) are performed in a loop until compilation succeeds or a maximum iteration threshold is reached.

This pipeline converts the monolithic challenge of full-theorem autoformalization into more tractable subtasks, each amenable to separate supervision (Patel et al., 2023).

Another architectural trend leverages explicit dependency graphs to mirror the logical structure of the original mathematical argument:

  • ProofFlow (Cabral et al., 13 Oct 2025) constructs a directed acyclic graph (DAG) representing theorem premises, definitions, intermediate lemmas, and solution steps. Each node is formalized independently as a Lean lemma (with “by sorry” placeholders), and proofs are completed in dependency order.
  • Aria (Wang et al., 6 Oct 2025) uses a two-phase "Graph-of-Thought" pipeline: (A) recursively decomposing a conjecture into a dependency graph of concepts, and (B) synthesizing formal code for each node in topological order, with semantic verification by grounding terms to library entries.

This structural approach prioritizes semantic and structural fidelity, enforcing that generated proofs respect both logical flow and library dependencies.

Iterative reference-free refinement, as exemplified by (Zhang et al., 30 Jan 2026), reframes autoformalization as repeated hill-climbing in a masked composite objective over four axes: Formal Validity (FV), Logical Preservation (LP), Mathematical Consistency (MC), and Formal Quality (FQ). Each candidate formalization is evaluated by a mixture of theorem prover verdicts and LLM-judge soft scores. A role-specialized generator pool (one-off, repairer, recurrent) is dynamically allocated to optimize the composite objective, with proven monotonicity and convergence.

ATF (Guo et al., 8 Oct 2025) sequentially applies Lean compiler checks and multi-LLM consistency judgments to each generated output, with each failure prompting targeted correction via model-generated revision.

These frameworks explicitly integrate both automated type-checking and multi-expert semantic evaluation into the language-model generation loop, significantly reducing hallucinations and semantic drift.

3. Benchmarks, Evaluation Metrics, and Empirical Results

Quantitative evaluation in full-theorem autoformalization is conducted using a combination of established and domain-specific metrics:

Metric Description Reference Example
BLEU, TER N-gram overlap with human formalizations (token-level) (Patel et al., 2023)
ProofScore Mean over syntactic correctness, semantic faithfulness, and structural fidelity per proof step (Cabral et al., 13 Oct 2025)
Compiler pass rate Percentage of generated code compiling under the theorem prover (Wang et al., 6 Oct 2025, Guo et al., 8 Oct 2025)
Semantic Consistency (e.g., AriaScorer) LLM-ensemble judgment of clause-level semantic equivalence with the source statement (Wang et al., 6 Oct 2025)
Final accuracy Proportion of examples compiling and scoring above a semantic threshold (Wang et al., 6 Oct 2025, Guo et al., 8 Oct 2025)

Notable benchmarks include:

Selected results:

  • In arXiv2Formal, in-context 10-shot GPT-3.5 yields BLEU 41.7, consistent with human translation adequacy (r ≈ 0.99) (Patel et al., 2023).
  • ProofFlow achieves ProofScore 0.545, over 4x baseline full-proof or step-wise approaches (Cabral et al., 13 Oct 2025).
  • Aria attains 68.5% final accuracy on ProofNet and reaches 44.0% on FATE-X, versus 24.0% for the best baseline (Wang et al., 6 Oct 2025).
  • ATF improves consistency check pass rate by 9%–29% across various benchmarks compared to prior models (Guo et al., 8 Oct 2025).

4. Specialized Strategies and Technical Innovations

Beyond generic LLM-driven pipelines, research advances several specialized strategies:

  • Retrieval-Augmented Formalization: Integration of definition-level retrieval (CRAMF (Lu et al., 9 Aug 2025)) or example retrieval via joint NL–formal embedding spaces (ProofBridge (Jana et al., 17 Oct 2025)) enhances concept grounding and semantic precision.
  • Template-Guided Instantiation: SITA (Li et al., 13 Nov 2025) formalizes concrete instances by instantiating abstract structure-theorem templates, using type-class mechanisms for modularity and reuse in Lean.
  • Neuro-symbolic Hybridization: In geometry, (Murphy et al., 2024) couples LLM-generated explicit proof scripts with a domain-axiomatized SMT engine for diagrammatic gap-filling and semantic equivalence checking.
  • Reinforcement Learning for Formalization: Mathesis (Xuejun et al., 8 Jun 2025), FormaRL (Huang et al., 26 Aug 2025), and ATF (Guo et al., 8 Oct 2025) optimize generation using a blend of compiler-based rewards and LLM-judged semantic rewards, with variants of GRPO and DPO used to bias generation toward both syntactic soundness and semantic faithfulness.
  • Backtranslation and Data Amplification: High-quality backtranslation (formal–informal–formal loops), with few-shot or line-by-line prompting, drastically improves model sample efficiency and formalization accuracy, winning over large but unfocused multilingual pretraining (Chan et al., 18 Feb 2025).

5. Limitations, Challenges, and Future Directions

Current limitations and open challenges center on:

  • Scale and Domain Generalization: Existing datasets, such as arXiv2Formal (50 theorems), trail the desired research-mathematics scale. Extension to thousands of theorems, advanced domains, and multi-library coverage remains ongoing (Patel et al., 2023, Weng et al., 29 May 2025).
  • End-to-End Integration: Most pipelines treat sub-tasks (statement, proof, entity linking) as sequential, missing opportunities for feedback loops or joint optimization (Patel et al., 2023, Zhang et al., 30 Jan 2026).
  • Structural and Semantic Gaps: Multipart theorems, long-chain dependency structures, and proof gap filling—especially outside of algebra or analysis—are not fully automated (Patel et al., 2023, Cabral et al., 13 Oct 2025).
  • Evaluation Robustness: Dependence on LLM semantic judges introduces bias; uncertainty calibration is an active area (Zhang et al., 30 Jan 2026).
  • Interactive Feedback: Human-in-the-loop systems or dialogue-based incremental refinement could address ambiguous or under-specified inputs (Mensfelt et al., 11 Sep 2025).

Emerging directions include tightly integrated joint-inference pipelines, expanded retrieval to include proof-level retrieval, reinforcement learning with formal-verification rewards, and architecture-agnostic approaches adaptable to multiple proof assistants.

6. Theoretical and Practical Impact

Full-theorem autoformalization represents the convergence of advances in LLMs, semantic parsing, symbolic reasoning, and formal methods infrastructure. It underpins future AI-powered mathematical research, enabling:

As methods scale toward larger corpora, deeper proof dependencies, and richer mathematical domains, full-theorem autoformalization is expected to play a transformative role in computational mathematics and formal verification at the research frontier.

Topic to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Full-Theorem Autoformalization.