CriticLean: Critic-Guided Autoformalization in Lean 4
- CriticLean is a critic-guided reinforcement learning framework that translates natural language mathematical statements into semantically faithful Lean 4 code.
- It leverages gradient-based feedback through CriticLeanGPT models, combining supervised fine-tuning with RL to correct subtle translation errors.
- The framework achieves up to 84% end-to-end correctness and is validated via specialized benchmarks and the extensive FineLeanCorpus dataset.
CriticLean is a critic-guided reinforcement learning framework designed to advance the field of automated mathematical formalization by elevating the role of the “critic” from a mere validator to an active, gradient-providing component. Specifically targeted at the translation of natural language (NL) mathematical statements into formal Lean 4 code, CriticLean systematically addresses the limitations of previous work, which has predominantly focused on syntactic correctness (e.g., compilation) rather than ensuring true semantic fidelity to the original mathematical intent. The framework consists of three primary technical contributions: the CriticLeanGPT family of critic models, the CriticLeanBench benchmark for high-precision critic evaluation, and the FineLeanCorpus, the largest open-source Lean 4 formalization dataset to date (Peng et al., 8 Jul 2025).
1. Motivation and Problem Statement
Previous approaches to autoformalization have largely prioritized the generation and compilation (“actor phase”) of Lean 4 code from NL mathematical statements, generally treating the “critic phase” as a passive compilation check. This paradigm is insufficient, as syntactic correctness does not guarantee semantic equivalence; Lean code may compile yet misrepresent essential aspects of the original statement, such as hypotheses, quantifier structure, or conceptual goals. The lack of a semantically rigorous critic leads to “syntactic overfitting,” wherein models optimize for passing compiler checks but systematically fail more substantive conceptual evaluations.
CriticLean proposes that by elevating the critic from a filter to an active teacher—providing gradient-based feedback within a reinforcement learning (RL) loop—the process can be “closed” so that only formalizations with high semantic fidelity are reinforced. This enables systematic correction of subtle translation errors and materially improves the reliability of autogenerated formal statements.
2. Architecture: CriticLeanGPT
The CriticLeanGPT family consists of lightweight, domain-specialized LLMs built on Qwen2.5 and Qwen3 backbones. Its training pipeline comprises two distinct phases:
- The CriticLeanInstruct dataset (48,000 examples) forms the core of SFT. It integrates:
- 16,000 seed critic judgments (2,000 human-annotated correct, 2,000 incorrect, each accompanied by chain-of-thought feedback, and 12,000 further examples mixed from code and math corpora at a 1:3 ratio).
- 32,000 additional examples augmented with data from FormalMATH and over 30,000 synthetically injected errors.
- The objective is to minimize the standard cross-entropy loss on the binary task: assessing if a given (NL statement, Lean code) pair is correct or incorrect:
Reinforcement Learning (RL):
- RL is performed via GRPO within a VeRL framework, making use of 4,000 seed examples with gold “correct/incorrect” judgments. A clipped PPO-style objective is used:
where and denotes advantage.
- Reward signals are:
This approach enables the model not only to filter incorrect formalizations but also to actively shape generation policies towards objectives of semantic accuracy.
3. Benchmarking: CriticLeanBench
CriticLeanBench is a 500-example benchmark specifically constructed to isolate and test critic reasoning—namely, the ability to distinguish semantically correct from incorrect Lean 4 code. The construction pipeline consists of four stages:
- Compilation of 500 NL/Lean code pairs drawn from diverse sources such as Omni-MATH and AIME.
- Compiler-based filtering, resulting in 250 passing the Lean 4 compiler and 50 automatic flags for compilation failure.
- LLM-based filtering applied to syntactic successes, tagging as likely correct or incorrect.
- Human expert evaluation through stratified sampling across complexity, domain, and various error types (Premise-translation, Goal-translation, Type errors, Representation, etc.), ensuring balanced error representation.
Metrics for model evaluation on CriticLeanBench include Accuracy (ACC), True/False Positive and Negative Rates (TPR/FPR/TNR/FNR). The benchmark ensures granular coverage and is explicitly structured to reveal both syntactic and semantic errors overlooked by compiler-check-only approaches.
4. Data: FineLeanCorpus
FineLeanCorpus consists of 285,957 NL-to-Lean 4 formalization pairs, making it the largest open-source dataset of its kind. Key characteristics include:
- Domain Diversity: Problems span high-school and undergraduate sources (AoPS, BlueMO, Omni-MATH), across Algebra, Number Theory, Geometry, Combinatorics, Calculus, Discrete and Applied Mathematics.
- Difficulty Distribution: Multimodal across difficulty levels 1–10, with 11.1% rated at “top-tier” (≥6). The “diamond” subset includes 36,033 high-difficulty (rating > 5) examples for advanced evaluation and training.
- Quality Assurance: Each example generated is subject to an autoformalization loop incorporating the CriticLean critic, with systematic spot-checks by qualified human annotators. Human-validated accuracy rates by source are reported (e.g., 84% for Omni-MATH, 96% for IneqMath).
FineLeanCorpus thus serves both as the foundation for supervised and RL training and as a high-quality resource for further research in automated formalization.
5. Experimental Results
Critic Performance
On CriticLeanBench, CriticLeanGPT (Qwen3-32B-RL) achieves ACC = 87.0%, TNR = 85.6%, outperforming baseline open-source models such as QwQ-32B (ACC = 86.4%, TNR = 79.2%) and comparable on overall accuracy to Gemini-2.5-Pro (ACC = 89.2%, TNR = 82.8%) (Peng et al., 8 Jul 2025). Incorporation of RL and mixed SFT yields a 5–10 percentage point increase in TNR compared to SFT-only instruct models in the Qwen2.5 family.
End-to-end Autoformalization
Yield across different configurations on human-verified correctness:
- Baseline single-pass (Kimina-7B): 38.0%
- Compiler-loop only: 54.0%
- Full CriticLean pipeline: 84.0%
Performance as a function of sample attempts indicates diminishing returns past 200 attempts (52.8% success), with steepest gain within the first 10 attempts (34.0%).
6. Ablation Studies and Analysis
Data modality mixing: Supervised fine-tuning with a 1:3 mix of critic, code, and math corpora (versus critic-only) improves ACC on Qwen2.5-32B from 71.0% to 76.2%.
Dataset size and scaling: Smaller models (7B) maintain steady gains up to 48,000 examples; larger models tend to plateau before full data utilization. Model performance on CriticLeanBench scales positively with model size (7B → 32B). Pass@32 yields additional 3–5 percentage point gains in ACC and greatest marginal increase in TNR post SFT+RL.
Error distributions: Lower-capacity models are dominated by type and syntax errors; semantically induced failures (Premise/Goal mistranslation) increase in prominence for higher-capacity, SFT+RL-tuned models.
7. Limitations and Prospective Directions
- Formalizable problem recall: Even after 200 attempts per statement, CriticLean’s pipeline reaches ≈53% recall for truly formalizable problems, reflecting a bottleneck in candidate generation diversity rather than critic capability.
- Computational cost: Multi-attempt RL loops and evaluation by large critic models introduce latency.
- Reward structure: Current binary accuracy and format rewards lack granularity for capturing subtle semantic differences; future work could explore graded/differentiable reward signals.
- Proof script extension: CriticLean currently judges only statements; integration of proof-step validation (e.g., using proof-assistant traces) could further improve reliability.
- Cross-prover generalization: Generalizing the critic-guided reinforcement framework to other proof assistants (e.g., Isabelle, Coq) and multi-prover setups is an open direction.
A plausible implication is that critic-guided RL frameworks analogous to CriticLean may provide robust semantically focused correction mechanisms in other formalization and automated reasoning domains.
Summary Table: CriticLean Components
| Component | Role/Content | Distinctive Features |
|---|---|---|
| CriticLeanGPT | Critic LLM trained via SFT + RL on Qwen2.5/Qwen3 backbones | Provides active, gradient-based feedback |
| CriticLeanBench | 500-example benchmark for critic reasoning | Stratified for semantic, not just syntactic, errors |
| FineLeanCorpus | 285,957 NL→Lean 4 dataset with broad domain/difficulty | Largest open-source Lean 4 formalization collection |
CriticLean establishes that critic-guided RL is essential for reliable, semantically faithful mathematical formalization, with marked improvements over previous purely generative or compiler-check-based approaches (Peng et al., 8 Jul 2025).