Autoformalization with Large Language Models

Updated 30 December 2025

Autoformalization with Large Language Models is the automated conversion of natural language statements into formal representations used in rigorous machine reasoning.
It leverages modular pipelines that combine LLM-driven translation, rule-based parsing, and iterative verification to bridge the semantic gap.
This approach enhances scalability and precision in formalizing mathematical, logical, and scientific content while enabling robust error correction.

Autoformalization with LLMs is the automated translation of informal mathematical, logical, or scientific statements—typically expressed in everyday language or domain text—into formal representations suitable for machine reasoning, verification, or symbolic manipulation. The rapid development of LLMs has fundamentally changed the landscape of autoformalization across mathematics, logic programming, game theory, and applied domains. Modern autoformalization systems are built as modular pipelines incorporating advanced neural models, rule-based systems, feedback-driven refinement, and formal verification stages. This field is characterized by highly technical workflows, diverse application areas, and evolving methodologies tailored to the precision, scalability, and semantic fidelity demands of formal reasoning.

1. Conceptual Foundations and Definitions

Autoformalization is formally defined as the process whereby a computational system, typically an LLM, realizes a function $f : L_i \to L_f$ translating informal language $L_i$ (e.g., English, LaTeX-math, requirement specifications) into a formal target language $L_f$ (e.g., Lean, Isabelle/HOL, Prolog, PDDL, STL) such that the semantic meaning of $x \in D \subseteq L_i$ is preserved up to an equivalence criterion $E$ in $f(x) \in L_f$ (Mensfelt et al., 11 Sep 2025). Thus, autoformalization encompasses a spectrum of tasks, including mathematical theorem statement conversion, logic program synthesis, planning domain translation, and knowledge graph construction.

Autoformalizers are typically organized as pipelines with key stages:

Preprocessing/Parsing: Extraction of variables, concepts, and structures from raw input.
Translation: LLM-based mapping from NL text to formal syntax using prompt engineering or fine-tuning.
Verification: Postprocessing with type-checkers, automated theorem provers, or property checkers.
Iterative Correction: Feedback integration via error traces, compilation failures, or semantic mismatches.

The defining challenge is the semantic gap: mapping nuanced, context-dependent NL expressions to precise, unambiguous formal code, often requiring deep domain knowledge and reasoning skills.

2. Methodological Architectures and Pipelines

State-of-the-art autoformalization leverages both neural and symbolic components. Key architectures include:

A. Neuro-symbolic Hybrid Pipelines

KELPS exemplifies a three-stage workflow:

Semantic parsing to an intermediate "Knowledge Equation" (KE), grounded in Assertional Logic.
Syntactic alignment via deterministic rule-based translation from KE to Lean, Coq, Isabelle, guaranteeing compositional preservation of meaning.
Automated validation: grammar check, compiler type-check, and semantic scoring using LLM judges (Zhang et al., 11 Jul 2025).

B. Feedback-Guided Refinement

Pipelines such as the game description formalizer or FMC integrate a formal solver (Prolog or Lean REPL) for syntax validation, with iterative error feedback enabling the LLM to self-correct malformed code (Mensfelt et al., 18 Sep 2024, Xie et al., 15 Jul 2025).

C. Multi-Agent Systems

MASA demonstrates modular multi-agent designs with specialized agents: AutoformalizationAgent performs NL→formal mapping, CritiqueAgents invoke theorem provers and LLM judges, RefinementAgents apply provable or semantic corrections, orchestrated via iterative communication loops and extensible tooling (Zhang et al., 10 Oct 2025).

D. Process-Level Supervision

Process-driven autoformalization (FormL4+PSV) uses stepwise compiler traces, labeling each tactic as correct/incorrect. The Process-Supervised Verifier (PSV) filters and ranks candidate formalizations, improving accuracy and sample efficiency (Lu et al., 4 Jun 2024).

E. Data-Centric Methods

High-fidelity, backtranslated paired datasets significantly outperform large but less curated corpora. On-the-fly and distilled backtranslation, as well as line-by-line proof state analysis, have demonstrated superior results for Lean4 formalization (Chan et al., 18 Feb 2025).

3. Formal Representation Strategies

Target formal languages and their encoding schemes vary by domain:

Mathematics: Situation Calculus style, Higher-Order Logic (Lean, Isabelle, Coq), assertional logic (KE).
Planning: PDDL and temporal extensions, mapping actions, preconditions, effects from NL descriptions (Mensfelt et al., 11 Sep 2025).
Game Theory: Prolog predicates for roles, moves, payoffs; normal-form tuple specification $G = (N, \{A_i\}, \{u_i\})$ ; extensive form via situation calculus in Prolog (Mensfelt et al., 18 Sep 2024).
PDEs/Physics: Differential Game Logic (dGL), Signal Temporal Logic (STL), hybrid automata; formulae specifying ODEs, constraints, and control objectives (2502.00963, Kabra et al., 26 Sep 2025).
Requirements Verification: NL requirements autoformalized to propositional Lean4 code, equivalence proofs attempted between requirement and LLM output (Gupte et al., 14 Nov 2025).
Other Domains: OWL, RDF(S), custom knowledge graphs; declarative rules, ontological assertion lists.

Intermediate languages (KEs, ASTs, process graphs) decouple semantic parsing from syntactic alignment, enabling multi-target translation and robust validation (Zhang et al., 11 Jul 2025).

4. Evaluation Benchmarks and Metrics

Autoformalization performance is assessed using rigorous benchmarks:

Syntactic correctness: Fraction of outputs compiling in the target system (Lean4, Isabelle/HOL, Prolog), often measured as pass@k (Mensfelt et al., 18 Sep 2024, Zhang et al., 11 Jul 2025, Xie et al., 15 Jul 2025).
Semantic accuracy: Expert annotation of payoff, role, and proof alignment; BLEU/CodeBLEU, alignment faithfulness, and semantic scoring by LLM judges (Mensfelt et al., 18 Sep 2024, Zhang et al., 11 Jul 2025, Xie et al., 15 Jul 2025, Zhang et al., 10 Oct 2025).
Correction effort: 0–4 scale reflecting edits needed for Lean4 statements (Gulati et al., 1 Jun 2024).
Proof success rate: Proportion of theorems correctly proven by ATP after autoformalization (Wu et al., 2022, Weng et al., 29 May 2025).
File-level correctness: Fraction of autoformalized files compiling without errors and containing no unresolved subgoals (Li et al., 13 Nov 2025).
Process-level precision/recall: Matching tactic-level labels to stepwise compiler feedback (Lu et al., 4 Jun 2024).

Key datasets include MiniF2F (488 Olympiad-level problems), ProofNet (Lean4 undergraduate statements), FMC (3,922 Olympiad problems aligned in Lean), FormL4 (17,137 informalized Lean4 statements/proofs), SITA Opt-bench (42 optimization problems), and domain-specific suites (kinematics, game theory) (Mensfelt et al., 18 Sep 2024, Zhang et al., 11 Jul 2025, Xie et al., 15 Jul 2025, Lu et al., 4 Jun 2024, Li et al., 13 Nov 2025, Zuo et al., 28 Sep 2025, Kabra et al., 26 Sep 2025).

Recent systems report syntactic correctness approaching 98% on structured game descriptions (Mensfelt et al., 18 Sep 2024) and 88.9% on KELPS MiniF2F (Zhang et al., 11 Jul 2025), with semantic accuracy varying from 81%–88% depending on domain and evaluation protocol.

5. Robustness, Limitations, and Error Analysis

Autoformalization remains challenged by the following:

Sensitivity to paraphrasing: LLM output varies substantially under semantically equivalent NL rewrites, affecting both compilation pass rates and semantic validity. Data augmentation with paraphrases, contrastive fine-tuning, and prompt engineering can mitigate—but not eliminate—such sensitivity (Moore et al., 16 Nov 2025).
Semantic ambiguity and under-specification: Non-standard or ambiguous NL inputs produce payoff misorderings, missing premises, or misaligned definitions, particularly in qualitative or metaphorical scenarios (Mensfelt et al., 18 Sep 2024, Moore et al., 16 Nov 2025, Gulati et al., 1 Jun 2024).
Scalability bottlenecks: Manual semantic validation limits throughput; file-level formalization and proof completion are limited by current LLM capabilities (Li et al., 13 Nov 2025, Zhang et al., 11 Jul 2025).
Tool-specific failures: Domain solvers (e.g. dGL for non-polynomial ODEs) and model checkers may fail on rare or structurally novel problems, requiring solver extension or logic lifting (Kabra et al., 26 Sep 2025).
Interactive limitations: One-shot prompting can miss critical context or fall short in higher-order proof structure or multi-step reasoning. Interactive, feedback-driven workflows (self-correction, repair loops, process-level supervision) are required for robust automation (Mensfelt et al., 18 Sep 2024, Lu et al., 4 Jun 2024, Zuo et al., 28 Sep 2025, Zhang et al., 10 Oct 2025).

6. Applications, Case Studies, and Cross-Domain Expansion

Autoformalization technologies increasingly serve as foundation models for automated reasoning, certified verification, and intelligent planning:

Game theory: Enables the translation of real-world strategic scenarios to Prolog-based solvers, supporting formal deduction of equilibria and strategic analysis (Mensfelt et al., 18 Sep 2024).
Mathematical competition: Automated Lean formalization of Olympiad problems for advancing neural theorem provers and establishing difficulty benchmarks (Xie et al., 15 Jul 2025, Wu et al., 2022).
Requirement specification and verification: Certifies logical equivalence and consistency of LLM-generated outputs with NL requirements, using formal proof obligations to detect mismatches (Gupte et al., 14 Nov 2025).
Model checking and formal verification: End-to-end construction of executable, correct-by-construction CSP# models with interactive repair and verification procedures (Zuo et al., 28 Sep 2025).
Physics and control: Formal modeling of kinematics and PDE-driven control problems, translating NL system descriptions to hybrid game logics and constraint programs (2502.00963, Kabra et al., 26 Sep 2025).
Structural theorem instantiation: SITA's abstract template instantiation enables research-level formalization of optimization algorithms, certifying correctness under reusable, abstract interfaces (Li et al., 13 Nov 2025).
Multi-agent collaboration: MASA's modular agents leverage LLMs, theorem provers, and knowledge bases for collaborative, iterative formalization, promoting extensibility and reliability in practical deployments (Zhang et al., 10 Oct 2025).

7. Future Directions and Open Challenges

Current research trajectories focus on:

Semantic verification advancements: Automated grounding and logic equivalence checking, meta-reasoning modules, and scaling up semantic judges to reduce human oversight (Mensfelt et al., 11 Sep 2025, Gupte et al., 14 Nov 2025).
Domain expansion: Extending KELPS-style intermediate representations and compositional pipelines to geometry, topology, hybrid systems, and higher algebra (Zhang et al., 11 Jul 2025, 2502.00963).
Interactive formalization assistants: Incorporating dialogue, incremental refinement, context retrieval, and clarification querying as core mechanisms (Mensfelt et al., 18 Sep 2024, Mensfelt et al., 11 Sep 2025).
Hybrid symbolic-neural architectures: Combining LLMs with SMT/ATP solvers, type-checking refinement, and process-level supervision for rigorous verification (Weng et al., 29 May 2025, Lu et al., 4 Jun 2024).
Benchmarks and metrics standardization: Defining cross-domain evaluation protocols capturing semantic fidelity, process correctness, and robustness to NL variability (Mensfelt et al., 11 Sep 2025, Moore et al., 16 Nov 2025).
Data quality optimization: Prioritizing small, high-fidelity paired datasets (backtranslation, proof state alignment) over massive, noisy multilingual corpora (Chan et al., 18 Feb 2025).

Autoformalization with LLMs stands as a crucial bridge from informal domain expertise to machine-checkable, verifiable reasoning, making formal methods accessible for mathematics, scientific planning, engineering, and policy research.