Mathesis-Autoformalizer: Lean 4 Translation

Updated 15 November 2025

The paper introduces Mathesis-Autoformalizer, a framework that automatically translates informal math problems into formal Lean 4 statements using reinforcement learning and neural modeling.
It integrates syntactic checks with Lean 4 compilers and semantic validation via ensemble LLMs, ensuring high-fidelity translation through rigorous tool feedback.
The system sets state-of-the-art benchmarks on Gaokao-Formal and MiniF2F, achieving notable pass rate improvements and demonstrating scalable performance.

Mathesis-Autoformalizer is an end-to-end automated framework for translating informal mathematical problems, typically specified in natural language, into formal statements in Lean 4 that are suitable for downstream formal theorem proving. This system establishes a methodological and empirical foundation for high-fidelity autoformalization by integrating data-driven neural modeling, reinforcement learning, proof-assistant tool feedback, and robust semantic validation regimes. Mathesis-Autoformalizer positions itself as state-of-the-art on challenging benchmarks such as Gaokao-Formal and MiniF2F, setting new standards for reliability and scalability in formal mathematical translation (Xuejun et al., 8 Jun 2025, Guo et al., 8 Oct 2025).

1. Core Architecture and Workflow

Mathesis-Autoformalizer consists of a multi-stage pipeline structured around three principal components:

Autoformalizer Module: Given a natural language (NL) math problem $x$ , an LLM-based policy $\pi_\theta$ generates a set of candidate formal Lean 4 statements $\{o_i\}_{i=1}^G$ . This policy is trained via reinforcement learning, specifically Group Relative Policy Optimization (GRPO) and Hierarchical Preference Optimization (HPO), to maximize formalization quality $R(x, o_i)$ . The optimization incorporates both syntactic correctness (Lean compilation) and semantic appropriateness (LeanScorer rating).
Validation and Ranking: Each candidate $o_i$ undergoes Lean 4 syntactic verification ( $V_{\mathrm{lean}}$ ) and semantic scoring via the LeanScorer framework, which decomposes the NL problem into subtasks, assigns ratings to each translation, and aggregates them using Sugeno fuzzy integrals:

$S = \max_{1\le i\le n}\;\min\bigl(f(e_{\pi(i)}),\,\mu(\{\pi(1),\dots,\pi(i)\})\bigr)$

where $f(A)=1.0$ , $f(B)=0.5$ , $f(C)=0$ for ratings $A$ =perfect, $B$ =minor, $C$ =major, and $\mu$ is a custom fuzzy measure.
Downstream Proving: The best $o^\star$ (highest-scoring, compilable formalization) is passed to Mathesis-Prover, an LLM+search system that constructs a formal Lean proof.

This pipeline advances previous work by performing autoformalization with domain-aligned RL, integrated semantic/syntactic tool feedback, and scalable neural architectures.

2. Optimization Paradigms for Autoformalization

The autoformalizer's training objective blends composite reward signals:

$R_{\mathrm{sem}}(x, o_i)$ : semantic fit between input and formalization (output "Appropriate" by LeanScorer).
$R_{\mathrm{ver}}(o_i)$ : syntactic validity (Lean 4 compiler success).

Candidates are sampled in groups, and the GRPO objective encourages the model to raise the log-probability of higher-reward candidates within each batch while regularizing towards a supervised reference policy: $L_{\mathrm{GRPO}(\theta) = -\,\mathbb{E}_{x}\,\mathbb{E}_{i,j}\bigl[\ell(r_i,r_j)\cdot\log\pi_\theta(o_i\mid x)\bigr] + \beta\,\mathrm{KL}[\pi_\theta || \pi_{\mathrm{ref}}]$ with $\ell(r_i,r_j) = \mathrm{sign}(r_i - r_j)$ .

Hierarchical Preference Optimization (DPO) further refines the model by directly ranking pairs of candidates according to end-to-end proof success.

This approach systematically exploits group-wise reward variance and pairwise preference signals, shown to improve success-rate@k by large margins—particularly on the Gaokao-Formal benchmark.

3. Tool Feedback and Consistency Checking

Mathesis-Autoformalizer tightly integrates proof-assistant routines during both training and inference. Model outputs undergo:

Lean 4 compiler tool-call: Each candidate is wrapped as a <tool_call name="syntax_check" ... /> API call, and error messages are encoded in context for further revision.
Multi-LLMs-as-judge semantic check: Ensemble LLMs (e.g., Qwen3-32B/QWQ-32B) act as consistency judges for semantic validation. Majority vote is used to reduce false positives:

$C(x, y) = \mathbf{1}\Bigl[\sum_{i=1}^2 c_i(x, y) > \tfrac{1}{2}\Bigr]$
Tool feedback masking: In the loss function, cross-entropy is masked on tool-result spans so that the model learns to respond to feedback rather than simply regurgitate it.

Empirically, ablations show that complete tool feedback (both syntactic and semantic) lifts consistency rates by more than 40 percentage points compared to models trained without tool integration (Guo et al., 8 Oct 2025).

4. Data Foundations and Dataset Engineering

Mathesis-Autoformalizer leverages and extends several large-scale datasets designed explicitly for NL–formal alignment:

NuminaMath-1.5: ~178k competition-level math problems.
Numina-ATF: ~750k synthetic formalizations passing both compiler and judge checks.
Gaokao-Formal: 488 human-curated problems from China's national college exam, with expert-verified Lean 4 statements and solutions.
MathLibForm/ATLAS: Hundreds of thousands of concept-driven, compiler-validated NL–Lean pairs produced via expert iteration, distillation, and automated augmentation (contraposition, proof-based goal extraction) (Liu et al., 8 Feb 2025).

By grounding formalization in these datasets, the system avoids "unknown definition" failures, improves generalizability, and ensures high coverage of mathematical concepts.

5. Evaluation, Benchmarks, and Quantitative Results

Mathesis-Autoformalizer establishes new state-of-the-art performance on flagship benchmarks:

Gaokao-Formal (k=6): Combined syntactic+semantic pass rate = 71% (GRPO + DPO), compared to 49% for Kimina baseline (absolute gain +22 pp, relative +45%).
MiniF2F (k=6): Combined pass rate = 96%, up to 79% at k=1.
ATF-32B (Tool Feedback variant): Pass@1 consistency = 65.38% on CombiBench, outperforming prior Goedel-32B by +29.13 pp (Guo et al., 8 Oct 2025).
Ablations: DPO preference tuning added +4 pp, and LeanScorer filtering reduced false positives (F1=0.92 vs 0.85 for binary LLM judge).

The pipeline shows incremental gains with increased group size and revision count; inference-time scaling with k→32 yields 100% pass@k on several datasets.

6. Methodological Extensions, Limitations, and Future Directions

Noted limitations include:

Occasional superficial Lean tricks (e.g., : True := by sorry) bypassing semantic filtering.
Reliance on singular staff solutions for small theorem domains (e.g., Peano arithmetic) may limit out-of-distribution robustness.
Geometry problems require additional explicit constraint extraction in formalization, often omitted by LLMs (Xie et al., 15 Jul 2025).
Current autoformalization focuses on statements, with proof-level translation and checking still under active development.

Opportunities for future work comprise:

Expanding concept coverage to graduate-level mathematics and more diverse formal libraries.
Integrating proof-search and tactic retrieval, enabling the generation of full working formal proofs, beyond "by sorry" placeholders.
Better semantic scoring (embedding-based, back-translation, round-trip validation).
Curriculum learning, adaptive sampling of weakly-performing concepts, and contrastive fine-tuning on near-miss formalizations.

7. Comparative Insights and Context

Mathesis-Autoformalizer is situated within a spectrum of autoformalization systems:

Baselines using naïve prompting and few-shot LLM generation prove less robust, showing pass@1 rates in the 20–40% range.
Retrieval-augmented generation (MS-RAG), denoising, and syntax error feedback steps enhance consistency and terminological match, seen in both Mathesis and sibling systems (Zhang et al., 5 Oct 2024).
Post-generation re-ranking with symbolic equivalence and semantic consistency methods further close the pass@1 → pass@k gap (Li et al., 28 Oct 2024).
Integrated pipelines with error-feedback, multi-pass sampling, and combined syntactic/semantic verification as in FMC and ATLAS further raise performance (Liu et al., 8 Feb 2025, Xie et al., 15 Jul 2025).

Mathesis-Autoformalizer unifies these advances by coupling functional RL-based optimization with tool integration and dataset engineering, achieving reliability and scalability on real-world NL-to-formal translation benchmarks.

Mathesis-Autoformalizer thus represents the current apex of end-to-end autoformalization technology, combining advances in LLM pretraining, data curation, reinforcement learning, and proof-assistant tool feedback to deliver robust formal translations of complex mathematical statements, and sets a foundation for fully automated formal theorem proving from natural language sources.