FormaRL: RL-Driven Autoformalization for Lean 4
- FormaRL is a reinforcement learning framework that autoformalizes natural language mathematics into Lean 4 code using a dual reward system for syntax and semantic accuracy.
- It employs a hybrid reward mechanism combining Lean compiler checks and large language model comparisons to ensure both syntactic correctness and semantic fidelity.
- Empirical results show significant improvements in pass rates (pass@1 and pass@16), outperforming supervised approaches with only a few hundred unlabeled examples.
FormaRL is a reinforcement learning–based framework for advancing autoformalization: the translation of natural language mathematics into formal language constructs, with a focus on Lean 4. Departing from supervised fine-tuning approaches reliant on large quantities of paired human annotations, FormaRL utilizes a hybrid reward mechanism—comprising Lean compiler syntax checking and semantic consistency assessment via LLM comparison—to drive policy improvement with minimal unlabeled data. The framework is further evaluated and benchmarked on both standard and advanced datasets, including the newly introduced uproof collection of undergraduate-level proof problems.
1. Framework Architecture and Objectives
FormaRL aims to enhance the autoformalization capabilities of LLMs with exceptionally minimal data requirements, specifically utilizing only hundreds of unlabeled mathematical problems. The formalizer model, which produces Lean 4 formalizations, learns through reinforcement learning rather than supervised learning on annotated data. The central innovation is the dual-reward mechanism. Each candidate formalization is subject to:
- Syntax Check (SC): Validation by compiling the output in Lean 4 (with Mathlib4 and relevant libraries), ensuring syntactic correctness and acceptance by the formal system.
- Consistency Check (CC): A LLM is prompted to compare the original natural language problem against the formal output, evaluating whether the essence—conditions, variables, conclusions—matches faithfully.
Only when both SC and CC evaluate positively is a reward signal of 1 assigned; otherwise, the reward is 0. This reward directly addresses both the formal language constraints and semantic adequacy, preventing “reward hacking” where trivial or syntactically correct but semantically vacuous outputs would otherwise be incentivized.
2. Reinforcement Learning Methodology
The policy is trained using a streamlined variant of Group Relative Policy Optimization (GRPO). For each natural language input , a group of candidate outputs is sampled under the current policy, . The training objective aggregates token-level policy updates across candidates in a group, driven by group-wise normalized advantage scores:
Notably, the FormaRL implementation omits an explicit KL divergence term between new and old policies; empirical results indicate that this simplification enhances both training efficiency and final autoformalization accuracy (Huang et al., 26 Aug 2025).
3. Reward Calculation and Checking Procedures
The reward pipeline is composed as follows:
Component | Function | Output |
---|---|---|
Syntax Check | Lean 4 compiler + libraries (fixed environment) | Accepts syntactic correctness |
Consistency Check | LLM prompt comparing source and formalization | Accepts semantic alignment |
Final Reward | Requires BOTH syntax and consistency checks | 1 if both pass; 0 otherwise |
This strict conjunctive requirement raises the bar for valid formalizations and enforces both the technical and interpretive fidelity of model output.
4. Dataset Construction and Evaluation Protocols
FormaRL’s benchmarking employs a variety of datasets, including:
- miniF2F / ProofNet: Established collections of math problems for autoformalization.
- uproof: A newly curated dataset of over 5,000 undergraduate-level proof problems extracted from 14 major mathematics textbooks, spanning analysis, algebra, topology, probability, and more. This resource supports robust evaluation of out-of-distribution generalization and advanced mathematical translation.
Experiments leverage as few as 859 unlabeled problems from these datasets for RL training—a dramatic reduction compared to prior works that utilize 25k–243k paired examples in supervised setups.
5. Empirical Performance and Comparative Results
Evaluation proceeds via the “pass rates” metric, specifically pass@1 (percentage of single-shot outputs passing both checks) and pass@16 (fraction of successful outputs among 16 sampled candidates):
- Qwen2.5-Coder-7B-Instruct: Pass@1 accuracy improved from 7.5% (SFT baseline) to 9.6% (FormaRL) on uproof (+2.1% absolute). Pass@16 accuracy increased from 21.2% to 33.6% (+12.4% absolute).
- ProofNet: The model’s pass@1 autoformalization accuracy improved by a factor of 4–6 (from 4.04% to 26.15%) with only 859 unlabeled training examples.
- Comparison to State-of-the-Art (SOTA): FormaRL’s RL-trained formalizer outperforms open-source SOTA autoformalizers such as DeepSeek-Math-7B in both in-distribution and out-of-distribution benchmarks (Huang et al., 26 Aug 2025).
These results demonstrate substantial advancements in sample efficiency and generalization, affirming the efficacy of the dual-reward mechanism and RL policy optimization.
6. Implications and Future Directions
The reduced dependence on annotated data and the direct alignment with formal verification criteria enable scalable construction of vetted formal corpora, which are critical for formal mathematics and automated theorem proving. FormaRL’s data-efficient paradigm suggests a shift toward reinforcement-based training as a viable pathway for domains with scarce expert annotations.
Potential extensions highlighted include:
- Incorporation of dependency retrieval augmentation, providing richer contextual signals from related definitions or theorems during checking.
- Integration of bidirectional extended definitional equivalence in the reward function, further tightening semantic fidelity.
- Expansion of the underlying datasets to include research-level mathematics, challenging current architectures and fostering continued innovation.
- Application of RL-based verification strategies to downstream tasks in automated theorem proving, potentially encompassing proof tactic synthesis or multi-stage formal verification pipelines.
7. Context within Autoformalization Research
FormaRL joins a landscape of recent RL-based autoformalization systems, distinguished by its exclusive reliance on formal syntax and semantic checks without human-labeled training data. In contrast, models such as StepFun-Formalizer (Wu et al., 6 Aug 2025) and Re:Form (Yan et al., 22 Jul 2025) employ supervised fine-tuning, curated dual-dataset strategies (formal knowledge, reasoning trajectories), and additional equivalence-based verification signals (BEq), demonstrating varied approaches along the spectrum of RL and SFT for formal translation. The introduction of uproof and the integration of automated checking modules reinforce current trends toward robust, data-efficient, and scalable formalization solutions.
In summary, FormaRL represents a targeted advancement for autoformalization under data-constrained settings. By combining Lean compiler validation and LLM-based semantic assessment, and optimizing via GRPO, it yields substantial performance gains with minimal supervision, advancing the reach of formal verification and theorem proving systems.