Papers
Topics
Authors
Recent
Detailed Answer
Quick Answer
Concise responses based on abstracts only
Detailed Answer
Well-researched responses based on abstracts and relevant paper content.
Custom Instructions Pro
Preferences or requirements that you'd like Emergent Mind to consider when generating responses
Gemini 2.5 Flash
Gemini 2.5 Flash 42 tok/s
Gemini 2.5 Pro 53 tok/s Pro
GPT-5 Medium 17 tok/s Pro
GPT-5 High 13 tok/s Pro
GPT-4o 101 tok/s Pro
Kimi K2 217 tok/s Pro
GPT OSS 120B 474 tok/s Pro
Claude Sonnet 4 36 tok/s Pro
2000 character limit reached

FormaRL: RL-Driven Autoformalization for Lean 4

Updated 28 August 2025
  • FormaRL is a reinforcement learning framework that autoformalizes natural language mathematics into Lean 4 code using a dual reward system for syntax and semantic accuracy.
  • It employs a hybrid reward mechanism combining Lean compiler checks and large language model comparisons to ensure both syntactic correctness and semantic fidelity.
  • Empirical results show significant improvements in pass rates (pass@1 and pass@16), outperforming supervised approaches with only a few hundred unlabeled examples.

FormaRL is a reinforcement learning–based framework for advancing autoformalization: the translation of natural language mathematics into formal language constructs, with a focus on Lean 4. Departing from supervised fine-tuning approaches reliant on large quantities of paired human annotations, FormaRL utilizes a hybrid reward mechanism—comprising Lean compiler syntax checking and semantic consistency assessment via LLM comparison—to drive policy improvement with minimal unlabeled data. The framework is further evaluated and benchmarked on both standard and advanced datasets, including the newly introduced uproof collection of undergraduate-level proof problems.

1. Framework Architecture and Objectives

FormaRL aims to enhance the autoformalization capabilities of LLMs with exceptionally minimal data requirements, specifically utilizing only hundreds of unlabeled mathematical problems. The formalizer model, which produces Lean 4 formalizations, learns through reinforcement learning rather than supervised learning on annotated data. The central innovation is the dual-reward mechanism. Each candidate formalization is subject to:

  • Syntax Check (SC): Validation by compiling the output in Lean 4 (with Mathlib4 and relevant libraries), ensuring syntactic correctness and acceptance by the formal system.
  • Consistency Check (CC): A LLM is prompted to compare the original natural language problem against the formal output, evaluating whether the essence—conditions, variables, conclusions—matches faithfully.

Only when both SC and CC evaluate positively is a reward signal of 1 assigned; otherwise, the reward is 0. This reward directly addresses both the formal language constraints and semantic adequacy, preventing “reward hacking” where trivial or syntactically correct but semantically vacuous outputs would otherwise be incentivized.

2. Reinforcement Learning Methodology

The policy is trained using a streamlined variant of Group Relative Policy Optimization (GRPO). For each natural language input qq, a group of candidate outputs {o1,o2,...,oG}\{o_1, o_2, ..., o_G\} is sampled under the current policy, πθold\pi_{\theta_\text{old}}. The training objective aggregates token-level policy updates across candidates in a group, driven by group-wise normalized advantage scores:

JGRPO(θ)=Eq,{oi}[1Gi=1G1oit=1oimin{πθ(oi,tq,oi,<t)πθold(oi,tq,oi,<t)A^i,t,clip(πθπθold,1ϵ,1+ϵ)A^i,t}]J_\text{GRPO}(\theta) = \mathbb{E}_{q,\{o_i\}}\left[\frac{1}{G}\sum_{i=1}^G \frac{1}{|o_i|} \sum_{t=1}^{|o_i|} \min \left\{ \frac{\pi_\theta(o_{i,t}|q, o_{i,<t})}{\pi_{\theta_\text{old}}(o_{i,t}|q, o_{i,<t})} \cdot \hat{A}_{i,t}, \text{clip}\left(\frac{\pi_\theta}{\pi_{\theta_\text{old}}},1-\epsilon,1+\epsilon\right) \cdot \hat{A}_{i,t}\right\}\right]

Notably, the FormaRL implementation omits an explicit KL divergence term between new and old policies; empirical results indicate that this simplification enhances both training efficiency and final autoformalization accuracy (Huang et al., 26 Aug 2025).

3. Reward Calculation and Checking Procedures

The reward pipeline is composed as follows:

Component Function Output
Syntax Check Lean 4 compiler + libraries (fixed environment) Accepts syntactic correctness
Consistency Check LLM prompt comparing source and formalization Accepts semantic alignment
Final Reward Requires BOTH syntax and consistency checks 1 if both pass; 0 otherwise

This strict conjunctive requirement raises the bar for valid formalizations and enforces both the technical and interpretive fidelity of model output.

4. Dataset Construction and Evaluation Protocols

FormaRL’s benchmarking employs a variety of datasets, including:

  • miniF2F / ProofNet: Established collections of math problems for autoformalization.
  • uproof: A newly curated dataset of over 5,000 undergraduate-level proof problems extracted from 14 major mathematics textbooks, spanning analysis, algebra, topology, probability, and more. This resource supports robust evaluation of out-of-distribution generalization and advanced mathematical translation.

Experiments leverage as few as 859 unlabeled problems from these datasets for RL training—a dramatic reduction compared to prior works that utilize 25k–243k paired examples in supervised setups.

5. Empirical Performance and Comparative Results

Evaluation proceeds via the “pass rates” metric, specifically pass@1 (percentage of single-shot outputs passing both checks) and pass@16 (fraction of successful outputs among 16 sampled candidates):

  • Qwen2.5-Coder-7B-Instruct: Pass@1 accuracy improved from 7.5% (SFT baseline) to 9.6% (FormaRL) on uproof (+2.1% absolute). Pass@16 accuracy increased from 21.2% to 33.6% (+12.4% absolute).
  • ProofNet: The model’s pass@1 autoformalization accuracy improved by a factor of 4–6 (from 4.04% to 26.15%) with only 859 unlabeled training examples.
  • Comparison to State-of-the-Art (SOTA): FormaRL’s RL-trained formalizer outperforms open-source SOTA autoformalizers such as DeepSeek-Math-7B in both in-distribution and out-of-distribution benchmarks (Huang et al., 26 Aug 2025).

These results demonstrate substantial advancements in sample efficiency and generalization, affirming the efficacy of the dual-reward mechanism and RL policy optimization.

6. Implications and Future Directions

The reduced dependence on annotated data and the direct alignment with formal verification criteria enable scalable construction of vetted formal corpora, which are critical for formal mathematics and automated theorem proving. FormaRL’s data-efficient paradigm suggests a shift toward reinforcement-based training as a viable pathway for domains with scarce expert annotations.

Potential extensions highlighted include:

  • Incorporation of dependency retrieval augmentation, providing richer contextual signals from related definitions or theorems during checking.
  • Integration of bidirectional extended definitional equivalence in the reward function, further tightening semantic fidelity.
  • Expansion of the underlying datasets to include research-level mathematics, challenging current architectures and fostering continued innovation.
  • Application of RL-based verification strategies to downstream tasks in automated theorem proving, potentially encompassing proof tactic synthesis or multi-stage formal verification pipelines.

7. Context within Autoformalization Research

FormaRL joins a landscape of recent RL-based autoformalization systems, distinguished by its exclusive reliance on formal syntax and semantic checks without human-labeled training data. In contrast, models such as StepFun-Formalizer (Wu et al., 6 Aug 2025) and Re:Form (Yan et al., 22 Jul 2025) employ supervised fine-tuning, curated dual-dataset strategies (formal knowledge, reasoning trajectories), and additional equivalence-based verification signals (BEq), demonstrating varied approaches along the spectrum of RL and SFT for formal translation. The introduction of uproof and the integration of automated checking modules reinforce current trends toward robust, data-efficient, and scalable formalization solutions.

In summary, FormaRL represents a targeted advancement for autoformalization under data-constrained settings. By combining Lean compiler validation and LLM-based semantic assessment, and optimizing via GRPO, it yields substantial performance gains with minimal supervision, advancing the reach of formal verification and theorem proving systems.

Forward Email Streamline Icon: https://streamlinehq.com

Follow Topic

Get notified by email when new papers are published related to FormaRL.

Don't miss out on important new AI/ML research

See which papers are being discussed right now on X, Reddit, and more:

“Emergent Mind helps me see which AI papers have caught fire online.”

Philip

Philip

Creator, AI Explained on YouTube