- The paper shows that successful Lean 4 autoformalizations remain semantically invariant under paraphrasing, with compile-boundary failures driving inconsistencies.
- It employs 60 deterministic surface perturbations on theorem statements and evaluates outputs using BEq+ and GTED across various models.
- The findings suggest that targeted training interventions to improve compile consistency may enhance the robustness of autoformalization systems.
Problem Statement and Motivation
The paper analyzes a central robustness challenge in Lean 4 autoformalization: whether semantic variance or merely surface linguistic variation (paraphrasing) induces divergent formal outputs in neural autoformalization. Most benchmarks evaluate LLM-based formalizers on a single phrasing per theorem, but these scores may fluctuate under meaning-conserving paraphrases—raising the question of whether these variations reflect genuine semantic disagreement or shallow failures at the compilation level.
Methodological Approach
The investigation employs a suite of 60 deterministic surface perturbation rules affecting theorem statement phrasing (e.g., conditional restructuring, concept renaming, quantifier variation), applied to the ProofNet# and miniF2F datasets. The perturbations are regex-triggered, guard-respected, and textbook-sourced to guarantee semantic invariance. Masking is used to preserve mathematical spans during paraphrasing. Four GPT-family models and three open-weight 7B autoformalizers are evaluated.
To adjudicate semantic equivalence, two metrics are used:
Compile-boundary failures (i.e., Lean output not accepted due to syntax/elaboration/identifier errors) are isolated to separate semantic agreement among successful formalizations from failures caused by syntactic or API problems.
Main Results
Semantic Robustness under Compiling Outputs
When both baseline and perturbed outputs compile, all N=602 paired predictions are semantically equivalent under BEq+, and structurally near-identical via GTED (median AST similarity 1.0). No exceptions were found in this regime across four GPT-family models and the datasets, despite low absolute accuracy (4.9-11.0%) versus reference formalizations.
Thus, paraphrase sensitivity arises almost solely from compile-boundary failures rather than any semantic divergence within successful generations.
Surface Consistency and Compilation Failures
Across models, surface consistency against paraphrasing is highly variable, with agreement rates ranging 19.1–49.5% (ProofNet#) and 44.5–55.5% (miniF2F) for GPT models; analogous low consistency is found for open-weight 7B models (19.8–55.6%). Compile rates per direction are only 11–24%, and most inconsistencies occur where at least one output fails to compile.
The taxonomy of compile failures is dataset-specific:
- ProofNet#: Dominated by unknown identifiers (34–50%); failures stem from misnavigating Mathlib API mappings due to input perturbations.
- miniF2F: Syntax and elaboration errors predominate (47–70% jointly); perturbations destabilize Lean code construction or vocabulary mapping.
Destabilizing axes (e.g., concept renaming, conditional restructuring) differ by dataset. For instance, concept renaming is especially detrimental on miniF2F (5% consistency), but less so on ProofNet#.
Model Generalization and Memorization Checks
Open-weight autoformalizers display the same compile-boundary failure pattern as closed GPT models. Memorization is not a dominant factor, as measured by n-gram similarity audits.
Equivalence Metrics: BEq+ Completeness
The equivalence claim is contingent on BEq+. Although BEq+ is sound, coverage is incomplete; rare cases where bidirectional proof search cannot verify equivalence are documented, especially in semantically correct but structurally divergent outputs.
Practical and Theoretical Implications
This dissociation between compile-boundary failures and semantic variance reorients both evaluation and intervention strategies:
- Benchmarks should report surface consistency (across paraphrasing) separately from compile-conditional semantic equivalence.
- Compile-boundary failures are automatic, dataset-dependent signals suitable for targeted training interventions: e.g., retrieval-augmented identifier resolution, constrained decoding over Mathlib API for ProofNet#, and compiler-feedback-based fine-tuning for miniF2F.
- Closing the compile boundary gap would increase robustness to superficial linguistic variation, revealing true semantic capability under meaning-preserving transformations.
The findings establish that end-to-end autoformalization robustness is currently bottlenecked at the interface between linguistic triggers and code-level syntax/API mapping—not semantic modeling by the LLM. This supports alignment of training and evaluation at the syntactic/compilation layer.
Limitations and Future Directions
- The perturbation suite is focused narrowly on conventional English via deterministic rules and does not address notational/multilingual edits.
- Large and differently trained autoformalizers (beyond 7B) have not been systematically probed.
- BEq+ completeness is a limiting factor, and only partial GTED cross-checking was performed.
Future research should target compiler-aware training mechanisms and expand evaluation to broader linguistic, notational, and cross-language scenarios, along with further refinement of semantic equivalence checking.
Conclusion
"Surface Sensitivity in Lean 4 Autoformalization" rigorously localizes robustness failures to the compilation boundary, demonstrating that autoformalizers are semantically invariant under paraphrasing conditional on successful compilation. This insight shifts both evaluation methodology and intervention focus toward compile-boundary issues, with direct implications for advancing robustness and semantic fidelity in the automatic formalization of mathematics.