CriticLean Framework

Updated 11 July 2025

CriticLean Framework is a critic-guided reinforcement learning system that enhances semantic fidelity in translating natural language math problems into Lean 4 formalizations.
It combines supervised fine-tuning and reinforcement learning with a dedicated critic model to iteratively refine generated formal code.
The framework introduces benchmarks like CriticLeanBench and datasets such as FineLeanCorpus to set new standards for trustworthy automated formalization.

The CriticLean Framework is a critic-guided reinforcement learning system that advances the reliability and semantic fidelity of formal mathematical reasoning by elevating the role of critique from passive validation to active learning. Developed in the context of natural language-to-Lean 4 code translation, CriticLean integrates supervised fine-tuning, reinforcement learning, and a dedicated domain-specific critic (CriticLeanGPT) for improving autoformalization. It features a benchmark (CriticLeanBench) explicitly tailored to evaluation of semantic correctness and a large, high-quality dataset (FineLeanCorpus) for training and assessment (Peng et al., 8 Jul 2025). The framework embodies a shift towards optimizing the “critic phase” as central to the production of trustworthy mathematical formalizations.

1. System Architecture and Workflow

CriticLean operates as an iterative autoformalization pipeline, transforming a natural language mathematical statement into formal Lean 4 code while ensuring both syntactic correctness and semantic fidelity. The core workflow is as follows:

The autoformalization model generates an initial Lean 4 formalization from a given problem statement.
The candidate code is first checked for syntactic correctness by the Lean 4 compiler.
CriticLeanGPT—an LLM trained to evaluate semantic agreement—assesses the formalization’s faithfulness to the original problem’s intent.
Critique feedback is used, via reinforcement learning (RL), to refine and improve the model’s output over subsequent iterations.

This elevates critic feedback beyond mere passive validation. Critique signals directly influence the optimization and further training of the autoformalizer. The architecture can be diagrammed as follows:

[Input Problem] → [Autoformalization Model] → [Lean4 Compiler Check]
                                                    |
                                             [If Compiles, then]
                                                    ↓
                                         [CriticLeanGPT: Semantic Critique]
                                                    |
                                             [If Passes, then]
                                                    ↓
                                       [Final Formalized Lean 4 Code]
                                            ↑
                               [RL Update to Generator/Autoformalizer]

2. CriticLeanGPT: Training and Optimization

CriticLeanGPT is a domain-specific LLM responsible for judging semantic equivalence between natural language problems and their Lean 4 formalizations. Its training and optimization comprise:

Supervised Fine-Tuning (SFT): Initially trained on the CriticLeanInstruct dataset, which contains pairs of mathematical problems, their formalizations, and associated expert critiques (either correct/incorrect labels or detailed explanatory feedback).
Reinforcement Learning (RL): Further refined via reward-based optimization—using R1-style algorithms such as Group Relative Policy Optimization (GRPO) within the VeRL framework. Rewards are based on critique accuracy against ground-truth judgments on semantic faithfulness.

The critic’s objective is to predict, for a given pair $(P, F)$ (problem $P$ , formalization $F$ ), whether $F$ semantically matches $P$ . The RL objective includes:

$r_\text{accuracy} = \begin{cases} 1 & \text{if judge’s output matches ground truth} \ 0 & \text{otherwise} \end{cases}$

and the policy optimization step follows:

$J_\text{online}(\pi_\theta) = \mathbb{E}_{x \sim \mathcal{D}, \{y_i\} \sim \pi_{\theta_\text{old}}} \left[ \frac{1}{G} \sum_i \min\left( \frac{\pi_\theta(y_i|x)}{\pi_{\theta_\text{old}}(y_i|x)} A_i, \text{clip}(\pi_\theta\ \text{ratio}, 1 - \epsilon, 1 + \epsilon) A_i \right) - \beta \cdot \text{KL}(\pi_\theta || \pi_\text{ref}) \right]$

Here, $A_i$ denotes advantage values, $\beta$ is a regularization coefficient, and KL denotes the Kullback-Leibler divergence to a reference policy.

3. CriticLeanBench: Benchmarking Semantic Fidelity

To rigorously assess semantic evaluation, CriticLean introduces CriticLeanBench—a benchmark tailored to the nuances of formalization assessment:

Composition: 500 problem pairs (250 semantically correct, 250 semantically incorrect), thoroughly validated by a combination of compiler checks, LLM-based verifiers (e.g., DeepSeek-R1), and human domain experts.
Coverage: Spans a broad range of semantic error types, including logical omissions, translation mistakes, and subtle misformulations.
Metrics: Includes Accuracy (ACC), True Positive Rate (TPR), False Positive Rate (FPR), True Negative Rate (TNR), and False Negative Rate (FNR), providing granular insight into critique capability.

Empirical evaluation demonstrates that CriticLeanGPT models achieve up to 87.0% ACC and 85.6% TNR, notably surpassing both open- and closed-source baselines in distinguishing semantic errors.

4. Corpus Development and Data Diversity

FineLeanCorpus underpins CriticLean’s training and evaluation with a large, diverse dataset of autoformalized mathematical problems:

Scale and Diversity: Over 285,000 problem–formalization pairs spanning Algebra, Geometry, Number Theory, Combinatorics, and Calculus, capturing a broad spectrum of complexity and problem length (problem tokens: 9–2,980; Lean code: 14–768).
Quality Filtering: Data undergoes three verification phases: (i) autoformalization, (ii) Lean 4 compiler check for syntax, (iii) semantic validation by CriticLeanGPT, with only high-confidence samples retained.
Iterative Regeneration: Problems failing semantic fidelity are regenerated, raising the pass rate from 38% in single-pass systems to as high as 84% post-full pipeline.

This corpus both trains CriticLeanGPT and serves as a testbed for assessing advances in autoformalization.

5. Experimental Results and Comparative Performance

Evaluation across several strong baselines (Gemini-2.5-Pro, Qwen3-series, DeepSeek-R1, and vanilla Lean pipeline) establishes:

Accuracy Gains: The Qwen3-32B-RL variant of CriticLeanGPT achieves an ACC of 87.0% versus significantly lower figures for competitors, showing a substantial advance for critic-guided autoformalization.
Semantic Detection: Improved TNR and FNR metrics indicate enhanced capability in detecting semantic mismatches rather than mere syntactic flaws.
Regeneration Success: In Pass@k (probability that at least one among k generations is correct), CriticLean-tuned pipelines outperform their non-critic counterparts, producing formalizations with higher genuine mathematical validity.

These results highlight the enabling role of critic-guided reinforcement learning for trustworthy mathematical code generation and verification.

6. Impact, Significance, and Future Directions

The CriticLean Framework demonstrates that a dedicated, RL-optimized critic phase is essential for faithful, reliable mathematical formalization in AI systems. Its key contributions are:

Semantic Evaluation as Core: Treating semantic verification as a first-class phase (not an afterthought) leads to higher reliability in formal reasoning tasks.
Scalable Critic Training: The pipeline leverages SFT and RL, harnessing both curated and augmented examples for robust critic model training.
Resource Creation: FineLeanCorpus and CriticLeanBench establish new standards and testbeds for research.
Practical Improvements: By iteratively improving both formalization models and their critics, real-world mathematical autoformalization systems benefit from higher accuracy and robustness.
Research Directions: Possible future work includes scaling the critic to address subtler semantic distinctions, more continuous generator–critic feedback, and refining RL reward granularity to capture additional aspects of proof quality.

CriticLean sets a precedent for AI frameworks in formal mathematics by tightly integrating critic-guided learning with large-scale, semantically targeted evaluation (Peng et al., 8 Jul 2025).

PDF Markdown Chat (Pro)

References (1)

CriticLean: Critic-Guided Reinforcement Learning for Mathematical Formalization (2025)

Follow Topic

Get notified by email when new papers are published related to CriticLean Framework.