Consistent Autoformalization

Updated 15 November 2025

Consistent autoformalization is a method that systematically translates informal mathematical statements into formal languages while preserving syntactic, terminological, and semantic consistency.
It integrates techniques like retrieval-augmented generation, denoising operations, and iterative syntax error feedback to markedly improve formal proof generation rates.
Employing methods such as MS-RAG and Auto-SEF, the approach enhances library-scale formal verification without needing LLM retraining and mitigates style drift.

Consistent autoformalization is the design and implementation of automatic, machine-driven pipelines for translating informal mathematical or logical statements into formal languages—while systematically preserving syntactic, terminological, and, critically, semantic consistency across instances, domains, and evolving mathematical libraries. Unlike naive, direct LLM-based translation, consistent autoformalization augments LLM generation with explicit quality control mechanisms. These include similarity-based retrieval, denoising operations, syntax-aware feedback loops, and multi-stage error correction, so that resulting machine-verifiable outputs maintain uniform notation, structure, and meaning, even as the corpus and target library grow or evolve. This approach is particularly salient in the context of proof assistants such as Isabelle/ZF, Lean 4, and Coq, where the compositionality and reuse of formal snippets require both robust local correctness and global stylistic coherence.

1. Motivating Principles and Challenges

The central challenge motivating consistent autoformalization lies in the inherent variability and ambiguity of informal mathematical language relative to the strict, machine-readable requirements of formal systems. As LLMs have shown substantial improvement in open-domain translation, this progress has not automatically yielded formal representations that exhibit the uniformity and reliability needed for large-scale mathematical library construction or systematic formal verification. Key sources of inconsistency include:

Syntactic drift: LLM outputs often produce statements with syntactic errors, ad-hoc notation, or unbalanced delimiters, leading to parse failures downstream.
Terminological and style drift: The absence of global context results in inconsistent naming (e.g., “union” vs. “∪”, “neighborhood system” vs. “nhds”), mismatched operator usage, and insufficient alignment with established library idioms.
Semantic underspecification: Missing assumptions, incomplete hypotheses, or unfaithful renderings limit the trustworthiness of machine-checked outputs.
Scaling mismatch: As the formal knowledge base expands, naive few-shot or zero-shot prompting cannot ensure reuse and style coordination across thousands of objects, lemmas, and definitions.

Addressing these issues requires a move beyond pure LLM-driven solutions towards a more architected approach that integrates retrieval, feedback, and iterative correction mechanisms.

2. Most-Similar Retrieval-Augmented Generation (MS-RAG)

The cornerstone of consistent autoformalization in library scenarios is Most-Similar Retrieval-Augmented Generation (MS-RAG). Instead of prompting the LLM with arbitrary or fixed in-context examples, MS-RAG retrieves from a corpus or knowledge base (KB) the k most similar previously formalized (informal, formal) pairs. Similarity is measured—typically via BM25 or related term-weighting schemes—between the new informal input and stored examples, ensuring topical as well as notational proximity.

Let the KB be represented as $\mathrm{KB} = \{(s_j, \phi_j)\}$ ; for a new informal statement $s$ the retriever returns $MS(s) = \mathrm{top\text{-}k}\ \mathrm{by}\ \mathrm{BM25}(s, d)$ . These k pairs are injected into the LLM prompt alongside the target $s$ . This strategy confines the LLM’s stylistic and terminological choices to the local conventions of the most relevant part of the formal library, thereby promoting consistency at both the syntactic and semantic levels.

Empirical results (e.g., Table A) demonstrate that, on MathLibForm with Mistral 7B, MS-RAG increases syntax pass rates from 0.0% (zero-shot) and 5.47% (fixed 3-shot) to 21.53%. BLEU-2, ChrF, RUBY, and CodeBERTScore all increase markedly, indicating improvement both by surface form and semantic embedding metrics.

3. Denoising: Code- and Prompt-Based Normalization

MS-RAG addresses similarity and style but does not guarantee that generated formal code is parsable by the target theorem prover. To that end, denoising procedures are employed:

Code-Based Denoising (CBD):

A suite of regex- or AST-based rewrite rules is applied to LLM-generated outputs to strip out extraneous narratives, spurious explanations, and non-standard code artifacts, thereby isolating only the required lemma/definition header and statement.

Prompt-Based Denoising (PBD):

Through dedicated "clean-up" prompts—utilizing the same set of retrieved examples and emphasizing style and structure alignment—the LLM is asked to re-render the candidate code, removing proofs and standardizing notation. For example, PBD ensures “ $\in$ ” is used instead of “in” or “element of,” and operator precedence matches conventions in prior formalizations.

Combining CBD and PBD generally achieves the highest gains: for Mistral 7B, pass rates improve from 17.15% with MS-RAG alone, to 21.53% after CBD, and to 28.10% with PBD (see Table B).

4. Auto-SEF: Auto-Correction with Syntax Error Feedback

Denoising alone cannot resolve deeper syntactic errors, especially as LLMs are prone to generating outputs that superficially resemble correct code but still fail to parse or type-check. Auto-SEF introduces iterative correction by leveraging the target prover’s syntax checker. Given the (potentially denoised) candidate code $\phi'$ :

Errors $\{e_k\}$ are reported by the prover.
The first (or most critical) error is formatted and injected into a targeted correction prompt to the LLM, along with the original MS-RAG exemplars and current code.
The LLM revises only the region implicated by the error, outputting a potentially fixable candidate.
This process repeats up to K times or until no errors remain.

This auto-correction loop is lightweight (no model retraining is required) and results in a further +5 percentage point increase in pass rates after five iterations (see Table C).

A representative example:

Input NL: "Given a,b ∈ ℤ, with b ∈ ℤ⁺, then a ≤ a+b, a ≠ a+b, and a+b ∈ ℤ."

Final Auto-SEF output:

1
2
3

lemma Int_ZF_1_5_L7A:
  assumes "a ∈ int"  "b ∈ int⁺"
  shows "a ≤ a + b"  "a ≠ a + b"  "a + b ∈ int"

5. Evaluation, Scalability, and Limitations

Evaluation leverages a suite of established string-based and semantic metrics (BLEU-2, ChrF, RUBY, CodeBERTScore) as well as syntactic pass rates computed via the theorem prover’s client interface (Isabelle for ZF in the cited case). On the MathLibForm test set, MS-RAG with denoising and Auto-SEF increases pass rates for Mistral 7B from 0% (zero-shot) to 33.58% after all refinement steps.

Because retrieval (here, BM25) operates efficiently over large KBs, and denoising/feedback are prompt-only (not parameter updates), the pipeline remains tractable as libraries scale. However, limitations persist: extremely informal or highly novel statements may not retrieve suitable exemplars, and LLMs struggle to preserve deeper semantic relationships while fixing syntax in multi-iteration error correction, causing diminishing returns after ~5-6 cycles. Formatting and style alignment can sometimes come at the cost of semantic precision, especially if LLMs overfit to the retrieved context.

6. Representative Examples and Impact

Examples drawn from the dataset illustrate the effect of each mechanism. For instance, baseline 3-shot prompting without retrieval may hallucinate new object names (“NHS(X)”), while MS-RAG roots the generated code in the precise library context (“{V ∈ Pow(⋃ T). ∃U ∈ T. x ∈ U ∧ U ⊆ V}”). Auto-SEF resolves lexical and structural parse errors (such as unmatched “{” or missing “∧”) that are otherwise a major barrier to aggregating LLM-produced code into a coherent library.

The approach supports library-scale autoformalization without retraining, mitigates the risk of style drift, and enables meaningful reuse of notation, naming, and terminology—key for sustaining large collaborative formal mathematics efforts.

7. Conclusion and Directions

Consistent autoformalization, as instantiated by the coordinated use of MS-RAG, denoising, and auto-correction via syntax error feedback, delivers marked improvements in the syntactic, stylistic, and semantic quality of LLM-generated formal mathematics. The method avoids the need for model retraining to achieve library-aware consistency, is modular for straightforward extension, and achieves notable pass rates and alignment metrics on MathLibForm and related testbeds (Zhang et al., 5 Oct 2024). Remaining challenges concern deepest semantic preservation, scalability to extremely large or heterogeneous corpora, and full integration of semantic embedding-based retrieval and ranking in place of term-based methods. The paradigm has broad potential for future extension into interactive proof environments and cross-domain formal knowledge bases, advancing the robustness and adoption of autoformalization methods for rigorous mathematical infrastructure.

PDF Markdown Chat (Pro)

References (1)

Consistent Autoformalization for Constructing Mathematical Libraries (2024)

Follow Topic

Get notified by email when new papers are published related to Consistent Autoformalization.