- The paper introduces CovCal, a method that enforces selective risk control by calibrating formal coverage thresholds in Lean-based mathematical reasoning.
- The paper demonstrates that high formalization coverage leads to accuracy rates of 96-98%, while low coverage drops reliability to as little as 20-10%.
- The study highlights the need for improved autoformalization fidelity and proposes calibrated risk strategies to ensure safe verification of natural-language math answers.
Risk-Controlled Lean-as-Judge for Selective Natural-Language Mathematical Reasoning
Problem Statement and Motivation
The paper "Risk-Controlled Lean-as-Judge for Natural-Language Mathematical Reasoning" (2605.28365) addresses the reliability of using formal proof assistants—specifically Lean—as partial judges in pipelines designed to solve mathematical reasoning problems stated in natural language. The dominant paradigm leverages autoformalization models to convert candidate answers into Lean theorems, asking the proof assistant to validate or reject each. While a Lean proof provides strong evidence that the associated formalized answer is correct for the formalized statement, failure to prove is fundamentally ambiguous: it may indicate an incorrect answer, a poorly stated formalization, missing background knowledge, or merely a search failure. This ambiguity motivates rigorous calibration procedures to control the risk of erroneous acceptance based on incomplete formal evidence.
A primary empirical finding established by the authors is the pronounced "coverage cliff": the reliability of accepting Lean-verified answers is highly sensitive to the proportion of candidate answer mass that (a) is successfully formalized, and (b) receives a proof from Lean. At high formal coverage, selecting the most-supported, Lean-proved answer is overwhelmingly reliable. At low formal coverage, however, the same procedure yields much lower correctness.
To enforce rigorous, finite-sample control over the selective risk for accepting answers based on partial formalization, the authors introduce CovCal—a selector that incorporates metrics over the formalization coverage log and abstains unless pre-specified coverage criteria are met. The candidate answer pipeline decomposes the candidates into equivalence classes (e.g., numerically or algebraically equivalent answers), weighs these by their support (such as self-consistency frequencies), and attempts to formalize and verify each using Lean. The formalization process produces a status for each answer class: proved, typechecked, ill-typed, timeout, or unformalized.
CovCal computes diagnostics including:
- Typed coverage: total probability mass of answer classes that successfully produce a Lean statement,
- Proved coverage: total mass of classes for which a proof is found and kernel-checked,
- Formal margin: difference in support between the highest-weight proved class and the strongest unresolved class.
CovCal returns an answer if and only if:
- The proportion of candidate mass reaching Lean is above τtyp​,
- The proved mass is above τprf​,
- The margin is above τM​,
- No conflict exists among proved classes.
Otherwise, CovCal abstains, optionally falling back to a weaker non-formal selection method (e.g., self-consistency).
A central numerical finding is the empirical "coverage cliff": the accuracy of selecting the most-supported proved class is 96% when proved coverage is high (Cprf​≥0.75), but only 20% when coverage is low (Cprf​<0.25). Similarly, partitioning by formal margin shows 98% accuracy for a high margin, but just 10% for negative margin scenarios. Thus, the reliability of Lean proofs in this context arises only when substantial candidate mass is successfully formalized and resolved.
(Figure 1)
Figure 1: Accepted fraction versus held-out selective-risk upper bound for CovCal and baseline selectors on the main run (dev-then-cal regime). The ε=0.15 risk contour delineates regions of valid acceptance; coverage-dependent selectors show sharply lower accuracy just below the certified threshold.
The formal verification signal is notably sparse and unfaithful: with a 7B autoformalizer, only 28% of problems exhibit at least one proved answer, and a manual audit finds that just τprf​0 of those proofs faithfully correspond to the intended problem (many are vacuously true, e.g., "x = x", or true-but-irrelevant side facts).
Finite-Sample Risk Calibration: Bonferroni and Dev-Then-Cal Paradigms
CovCal employs two selective risk calibration regimes:
- Bonferroni: For a fixed grid over the threshold parameters, select the cell maximizing coverage whose calibration-split Clopper–Pearson upper bound on accepted risk is below the target risk after a union-bound correction over the grid. This regime is conservative and may yield vacuous accept regions if the formal signal is too sparse.
- Dev-Then-Cal: Select the threshold on an independent development split, evaluating the risk only once on the calibration split, enabling tighter certificates when coverage is limited.
The feasibility of risk-controlled selection depends fundamentally on autoformalization coverage. With sparse formalization (e.g., 7B Qwen2.5-Coder, τprf​1 coverage), the Bonferroni regime certifies no thresholds (τprf​2 partitions in bootstrapped evaluation), while dev-then-cal is feasible only τprf​3 times, covering approximately τprf​4 of items at τprf​5 accuracy. With a prover-specialized formalizer (Goedel-Prover-V2-8B), coverage rises to τprf​6 and Bonferroni certifies τprf​7 partitions, accepting close to τprf​8 of examples at τprf​9 accuracy.
Contrast with Baselines and Practical Implications
For context, a pure self-consistency selector—without any Lean verification—already achieves τM​0 accuracy with full coverage. The contribution of risk-controlled formal selection, therefore, is not raw accuracy but the ability to provide a population-level certificate: accepted formal predictions are guaranteed (in the calibration sense) to have error rates below the designated threshold, conditional on the accept criteria.
The fallback mechanism restores full coverage by deferring unresolved cases to the self-consistency baseline, but only the formally accepted subset remains covered by the selective-risk certificate. The diagnostic risk upper bounds for confidence-only abstention and Lean-based selectors demonstrate a sharp tradeoff: higher acceptance fractions correlate with looser (i.e., less predictive) bounds unless formal coverage thresholds are enforced.
CovCal's strict abstention when coverage is inadequate aligns with the mathematical result that unresolved answer classes are indistinguishable given only partial formal logs, precluding non-trivial correctness guarantees except via abstention or further external validation.
Automated rational-evaluation and manual audits of Lean-proved statements reveal significant faithfulness deficits: only a minority of autoformalized, proved answers genuinely correspond to the problem as stated in natural language. Many high-confidence, formally proved answers do not capture the intended semantics—some assert irrelevant or vacuously true statements. This highlights a practical limitation: even with calibrated certificates over the selective subset, true end-to-end reliability depends on the faithfulness of autoformalization procedures, not just the success of Lean verification.
Theoretical Implications and Limitations
The work delivers a rigorous, distribution-free selective prediction certificate for partial-coverage formal reasoning pipelines, advancing state-of-the-art in safe formal answer selection. The theoretical claim that unresolved answer mass precludes non-vacuous certificates sharply delimitates the power of black-box verification: formalization and proving must achieve sufficient coverage and faithfulness before any downstream calibration scheme can guarantee error rates.
Limitations center on autoformalization fidelity and coverage—the certificate is effective only under sufficiently dense and faithful formalization. The empirical evaluation is also limited to MATH-500-style datasets, so transfer to other domains remains unproven. Lastly, the coverage cliff, sparsity of genuine proofs, and non-trivial rates of semantically-vacuous proofs all caution against over-reliance on coverage-based certificates in contemporary math LLM systems.
Future Directions
Advancement in autoformalization—either via larger, more specialized models or more robust semantic alignment—is pivotal for expanding the fraction of problems certifiable under risk-controlled selection. Enhanced diagnostics for faithfulness checking (possibly leveraging richer semantics or relational checks among proofs) might further increase the reliability and coverage of formal-selective methods. Integrating CovCal with hybrid agentic pipelines or employing retrieval-augmented autoformalizers could also push the coverage frontier.
Conclusion
This study demonstrates that the reliability of using Lean as a downstream judge for natural-language mathematical reasoning is sharply dependent on formal coverage. CovCal provides rigorous, distribution-free selective risk certificates for the subset of examples where sufficient formal evidence supports answer selection. The approach is informative and safe precisely when the autoformalizer is specialized enough to deliver high coverage; otherwise, abstention is not only prudent, but theoretically necessary. The work advances the calibration paradigm for formalized math reasoning with LLMs, making explicit the interplay between prover coverage, diagnostic faithfulness, and the attainable risk guarantees in modern selective prediction.