CriticLeanBench: Semantic Verification Benchmark
- CriticLeanBench is a benchmark evaluating semantic fidelity by testing whether Lean 4 code accurately formalizes natural math statements.
- It employs a multi-stage methodology with syntactic filtering, automated LLM judgment, and human verification to distinguish correct from flawed formalizations.
- Quantitative metrics like accuracy, TPR, and TNR highlight its role in refining RL-trained critic models and advancing automated theorem proving.
CriticLeanBench is a rigorously constructed evaluation benchmark introduced to assess the ability of LLMs to verify semantic fidelity in the autoformalization of natural language mathematical statements into Lean 4 code. As part of the CriticLean framework, it plays a central role in advancing the state of formal mathematical reasoning by providing a testbed for distinguishing between semantically correct and flawed formalizations, facilitating improvements in both automated theorem proving and the design of critic models for mathematical translation tasks (Peng et al., 8 Jul 2025).
1. Purpose, Scope, and Construction
CriticLeanBench was developed to address a gap not covered by prior benchmarks: ensuring that the critiquing phase in autoformalization workflows evaluates not just syntactic correctness (e.g., Lean compiler pass) but—more critically—semantic fidelity (i.e., preservation of the original mathematical intent). The benchmark comprises 500 carefully curated problem–formalization pairs, evenly split between 250 semantically correct and 250 semantically flawed Lean formalizations.
The construction pipeline draws on multi-stage validation:
- Syntactic Filtering: Candidate Lean code must pass the Lean 4 compiler.
- Automated LLM Judgement: Pairs are processed using DeepSeek R1 in a templated protocol to classify semantic correctness (as described in the formal verification appendix).
- Human Verification: Human annotators further validate whether each formalization accurately preserves the semantics of the original problem statement.
This layered design ensures that CriticLeanBench captures a wide range of error types, not only testing robustness to syntax issues but—more importantly—semantic errors such as premise mistranslation or incorrect formalization of goals.
2. Evaluation Metrics and Protocol
CriticLeanBench uses a suite of quantitative metrics to measure model performance in discerning semantic correctness:
- Accuracy (ACC): The overall fraction of correctly classified instances.
- True Positive Rate (TPR): The proportion of semantically correct formalizations correctly identified.
- True Negative Rate (TNR): The proportion of flawed formalizations accurately rejected.
- False Positive Rate (FPR)/False Negative Rate (FNR): Measures of over-accepting faulty formalizations or missing correct ones.
Additionally, Pass@k metrics (for values such as ) are reported, evaluating whether top-ranked candidates from multiple outputs contain a semantically correct formalization. Inputs in CriticLeanBench are long (text token ranges approximately 495–1,583), reflecting practical, real-world formalization challenges.
These metrics collectively test both the ability to recognize semantic matches and the reliability of error detection, ensuring an objective appraisal of critic models.
3. Comparative Performance and Model Variants
Empirical results demonstrate that CriticLeanBench differentiates sharply between critic models of varying training regimens and capacities:
- CriticLeanGPT (trained with RL or supervised fine-tuning): Consistently achieves superior accuracy and error detection rates compared to strong open-source and closed-source models (e.g., Gemini-2.5-Pro, QwQ-32B, Qwen3-32B-RL).
- Supervised vs. RL-trained Critics: RL-refined critic models exhibit higher reliability, particularly on negative cases requiring subtle semantic discernment.
- Baselines: Even leading baseline models—though often strong in overall accuracy—are outperformed in both TNR and Pass@k when compared against models specialized in semantic critique through targeted critic data training.
A representative result table:
Model | Accuracy (%) | TPR | TNR |
---|---|---|---|
Qwen3-32B-RL | ~88 | High | Best |
Gemini-2.5-Pro | ~86–87 | High | High |
QwQ-32B | ~86 | High | High |
(*Exact percentages and rankings reported in the main text; see original table in (Peng et al., 8 Jul 2025).)
The superior performance of CriticLean-trained critics is attributed to direct optimization not merely for syntax/compilation, but for deep semantic fidelity, leveraging both the CriticLeanInstruct dataset and RL-based objectives.
4. Integration within the CriticLean Framework
Within the CriticLean framework for critic-guided reinforcement learning of autoformalization, CriticLeanBench serves as the authoritative evaluation standard:
- Feedback Loop: During autoformalization, candidate Lean code is iteratively refined based on feedback from both the Lean compiler and CriticLeanGPT.
- Critic Role: The critic operates not just as a binary filter but as an RL-trained signal provider guiding the model to maximize semantic agreement, as measured by CriticLeanBench outcomes.
- RL Training Objective: Training uses an objective including both a clipped reward (matching the CriticLeanBench ground-truth label and output format conformance) and KL-regularization:
- Benchmark Role: CriticLeanBench constitutes the "ground truth" bedrock for validating the effectiveness and generalization of the trained critic, tightly connecting critic improvements to downstream gains in formalization quality.
5. Dataset Characteristics and Error Taxonomy
CriticLeanBench is built in conjunction with FineLeanCorpus, an extensive dataset of over 285,000 problems derived from various sources (AoPS, Olympiad benchmarks, undergraduate challenges), ensuring diversity in domain and difficulty. Crucially, all CriticLeanBench examples are further curated:
- Syntactic Errors: Type and syntax errors are filtered pre-benchmark.
- Semantic Errors: Predominant focus on capturing premise translation errors, goal mismatches, and logical inconsistencies—comprehensively categorized in the error taxonomy of the paper.
- Error Diversity: Error type distribution is reported, enabling researchers to analyze model robustness to specific classes of semantic failures.
This careful curation ensures that CriticLeanBench is not only representative but also sufficiently challenging for model evaluation.
6. Significance and Future Impact
The rigorous design of CriticLeanBench highlights several important directions for formal reasoning and the automated theorem proving community:
- Semantic Emphasis: Elevates the critic from a passive validator to an active, semantically sensitive assessor, promoting a new standard beyond pure syntactic checks.
- Transferable Blueprint: The multi-level validation and strict semantic focus may serve as a blueprint for benchmarks in other formal languages and domains (such as Coq, Isabelle, or TPTP).
- Advancing Critic Models: By profiling detailed error types and performance breakdowns, CriticLeanBench provides actionable signals for the iterative refinement of critic models—particularly those trained via reinforcement learning or dedicated instruction data.
- Broad Applicability: The benchmark is expected to influence the development of education tools, verification pipelines, and more reliable autoformalization systems.
- Self-Improvement Paradigm: By directly linking critic reliability to improved generative outputs, the CriticLeanBench methodology may foster further research into autonomous self-improvement techniques within LLM architectures.
In summary, CriticLeanBench constitutes a pivotal evaluation resource for semantic verification in mathematical formalization, providing the methodological rigor and diagnostic specificity required to advance critic-based approaches in automated reasoning (Peng et al., 8 Jul 2025).