CriticLeanGPT: Semantic Critic for Lean 4
- CriticLeanGPT is a domain-specific LLM and evaluation framework that actively refines Lean 4 formalizations by aligning natural language math with formal code.
- It employs a two-stage process combining Supervised Fine-Tuning and GRPO-based reinforcement learning to ensure both semantic fidelity and syntactic correctness.
- The framework leverages benchmarks and the extensive FineLeanCorpus to deliver robust error detection and iterative guidance in automated formal reasoning.
CriticLeanGPT is a domain-specific LLM and evaluation framework developed for the semantic assessment of Lean 4 mathematical formalizations, representing a culmination of research trends in critic-guided learning, critique-based reinforcement learning, and benchmark-driven LLM evaluation. Within the broader context of automated theorem proving and semantic alignment for code and mathematical reasoning, CriticLeanGPT constitutes the centerpiece of the CriticLean framework, elevating the critic phase from a passive verifier to an active learning component that supervises, corrects, and iteratively refines formalizations, thereby enhancing both their syntactic validity and semantic fidelity (Peng et al., 8 Jul 2025).
1. Framework Structure and Training Paradigm
CriticLeanGPT is architected as an instruction-tuned model drawn from the Qwen LLM series, specifically tailored for the critical evaluation of Lean 4 formalizations derived from natural language mathematical statements. Its operation is realized within a two-stage critic-guided reinforcement learning pipeline:
- Supervised Fine-Tuning (SFT): The model is first trained on the CriticLeanInstruct dataset, mixing math, code, and critic data. SFT ensures that CriticLeanGPT attains baseline alignment between natural language statements and Lean 4 code, encompassing common mathematical transformations and domain conventions.
- Reinforcement Learning with GRPO (VeRL framework): After SFT, the model undergoes RL optimization leveraging the Group Relative Policy Optimization (GRPO) algorithm. The model's output is evaluated with discrete rewards for semantic judgment (agreement with human-labeled ground truth) and syntactic correctness (proper Lean 4 code format). The final reward is the logical conjunction
ensuring that both semantic and syntactic targets are met. The RL objective combines policy improvement and Kullback–Leibler (KL) regularization:
Semantic fidelity feedback from CriticLeanGPT is integrated into an iterative process, vetting and guiding successive autoformalization attempts.
2. Semantic Fidelity Assessment: Methodology and Implementation
CriticLeanGPT is designed to provide rigorous semantic alignment between the original natural language mathematical statement and its Lean 4 code translation. The evaluation proceeds with these steps:
- Syntactic Filtering: The candidate code is first passed through the Lean 4 compiler to ensure basic correctness.
- Template-Guided Decomposition: CriticLeanGPT’s assessment template (specified in the supplementary material) guides a structured breakdown of the mathematical content, extracting and matching conditions (e.g., quantifiers, goals, logical connectives) across natural and formal languages.
- Judgment Generation: The model issues a binary verdict (semantically aligned / misaligned) and, optionally, a structured commentary on error type (e.g., goal misalignment, omitted conditions).
- Reward Assignment: In RL, reward is only given if both the semantic and format judgments are correct, reinforcing careful, precise critique. This approach ensures that the model does not merely detect superficial or syntactic errors but meaningfully engages with the mathematical semantics, identifying subtle errors of translation that could elude traditional compilation-based validation.
3. Benchmarking: CriticLeanBench and Evaluation Metrics
CriticLeanBench is introduced as a specialized benchmark to evaluate LLMs’ abilities to distinguish semantically correct from incorrect Lean 4 formalizations. Key characteristics include:
- Composition: 500 hand-curated natural language and Lean 4 pairs (250 correct, 250 incorrect) filtered for a spectrum of error types, including premise translation, goal misalignment, and logical operator misuse.
- Metrics: The benchmark employs accuracy (ACC), true positive rate (TPR), true negative rate (TNR), false positive (FPR), and false negative rates (FNR) to assess discrimination performance, reflecting not only raw correctness but nuanced error detection.
- Comparative Analysis: CriticLeanGPT and its Qwen3-32B-RL variant achieve overall accuracy of approximately 87.0%, with high TNR (85.6%) and relatively low FNR (14.4%), surpassing both open-source (e.g., QwQ-32B, DeepSeek-R1) and closed-source baselines (e.g., Gemini-2.5-Pro, GPT-4 family).
Ablation studies show the necessity of both SFT and RL—models omitting either phase exhibit degraded ability to identify subtle semantic inaccuracies. Integration of multi-task data (code and mathematical reasoning) further enhances robustness and coverage.
4. FineLeanCorpus: Scale, Diversity, and Role
FineLeanCorpus serves as the primary training and evaluation corpus for CriticLeanGPT:
- Scope: Over 285,000 autoformalized mathematical problems, spanning high school and university competition domains with a wide distribution of lengths and complexity.
- Qualitative Range: The dataset encompasses problems as short as 9 tokens and as long as 3,000 tokens (mean ~78 tokens), with formalizations likewise varying substantially; specialized subsets (notably “Diamond,” 36,033 difficult problems) support high-difficulty evaluation and model stress-testing.
- Purpose: By exposing CriticLeanGPT to broad domain and structural variability, FineLeanCorpus mitigates overfitting and ensures practical utility across mathematical subfields and formalisms. The dataset structure enables stratified evaluation across error classes, semantic depth, and difficulty.
5. Comparative Performance and Impact
CriticLeanGPT demonstrably outperforms contemporary open-source and closed-source LLMs as a semantic critic for formal mathematics:
- Fine-tuning with RL (GRPO) and multi-task data correlates with marked gains, notably in true negative discrimination (doorstep for rejecting incongruent formalizations), while reducing error rates (particularly FNR) for subtle semantic mismatches.
- Iterative critic guidance, as implemented in CriticLean, increases the reliability of automated formalization, highlighting critical errors that would be invisible to compilation-based systems.
- Wider application is suggested by the plug-and-play nature of the critic phase, as it naturally generalizes across the formalization pipeline to guide multiple candidate solutions, functioning as both a quality filter and a source of revision signals.
- A plausible implication is that similar critic-phase models could enhance the semantic robustness of code synthesis, scientific data translation, and other formal reasoning applications requiring alignment beyond syntax.
6. Implications for Formal Reasoning and Future Research
The CriticLeanGPT approach positions the critic as a central, actively optimized element of formal mathematical reasoning automation:
- Semantic-Centric Verification: By prioritizing semantic fidelity alongside syntax, CriticLeanGPT offers a pathway to more trustworthy, interpretable, and ultimately autonomous theorem proving agents.
- Generalization Potential: While currently developed for Lean 4, the framework could inform critic-centric evaluation for other formal languages and domains where semantics are paramount (e.g., Coq, Isabelle, TLA+).
- Data and Human Feedback: Expansion of training data to encompass more mathematical subdomains and the integration of human feedback may further elevate performance, especially in areas where automated metrics are insufficient to capture subtle semantic details.
- Optimization Techniques: Combining reinforcement learning with gradient-based approaches, model ensembles, or additional hierarchical critic modules offers potential for improved fidelity and correction ability.
- Broad Applicability: The modular architecture is suited to generalize to other forms of structured data and may accelerate advances in automated code review, protocol verification, and scientific text formalization.
7. Conclusion
CriticLeanGPT exemplifies a shift in automated formalization pipelines from compilation-focused correctness to critic-guided semantic verification. Through supervised and reinforcement learning on large and diverse mathematical corpora, rigorous benchmark evaluation, and critic-centric iterative refinement, it sets a new standard for semantic assessment in autoformalization and signals promising future directions for research in LLM-guided formal reasoning, code verification, and high-stakes symbolic alignment tasks (Peng et al., 8 Jul 2025).