Risk-Controlled Lean-as-Judge for Natural-Language Mathematical Reasoning

Published 27 May 2026 in cs.AI, cs.CL, and cs.LO | (2605.28365v1)

Abstract: Lean is increasingly used to judge natural-language mathematical answers, but its signal is partial: many answers never formalize, and a failed proof may reflect an ill-typed statement or a missing library fact, not a wrong answer. On MATH-500 we show this signal is (i) sharply coverage-dependent, that is the proof-winning answer is correct 96% of the time at high proved coverage but 20% at low, and (ii) sparse and often unfaithful: a 7B autoformalizer proves a class for only 28% of problems, and a manual audit finds only approximately 43% of those proofs faithful. We propose COVCAL, a selector over Lean-trace diagnostics that certifies a finite-sample selective-risk bound on accepted answers or abstains, under two regimes (a conservative Bonferroni bound and a tighter dev-then-cal rule). Feasibility depends on autoformalization coverage: with the 7B formalizer the signal is too sparse and Bonferroni abstains on all 20 bootstrap partitions, whereas a prover-specialized formalizer reaches 79% coverage and flips it to feasible on 17 of 20, accepting approximately 48% of problems at 0.98 accepted accuracy. Since self-consistency alone is already 91% accurate, our contribution is a precise account of when, and with which formalizer, a partial formal signal can be trusted under risk control.

Abstract PDF Upgrade to Chat

Authors (5)

Summary

The paper introduces CovCal, a method that enforces selective risk control by calibrating formal coverage thresholds in Lean-based mathematical reasoning.
The paper demonstrates that high formalization coverage leads to accuracy rates of 96-98%, while low coverage drops reliability to as little as 20-10%.
The study highlights the need for improved autoformalization fidelity and proposes calibrated risk strategies to ensure safe verification of natural-language math answers.

Risk-Controlled Lean-as-Judge for Selective Natural-Language Mathematical Reasoning

Problem Statement and Motivation

The paper "Risk-Controlled Lean-as-Judge for Natural-Language Mathematical Reasoning" (2605.28365) addresses the reliability of using formal proof assistants—specifically Lean—as partial judges in pipelines designed to solve mathematical reasoning problems stated in natural language. The dominant paradigm leverages autoformalization models to convert candidate answers into Lean theorems, asking the proof assistant to validate or reject each. While a Lean proof provides strong evidence that the associated formalized answer is correct for the formalized statement, failure to prove is fundamentally ambiguous: it may indicate an incorrect answer, a poorly stated formalization, missing background knowledge, or merely a search failure. This ambiguity motivates rigorous calibration procedures to control the risk of erroneous acceptance based on incomplete formal evidence.

A primary empirical finding established by the authors is the pronounced "coverage cliff": the reliability of accepting Lean-verified answers is highly sensitive to the proportion of candidate answer mass that (a) is successfully formalized, and (b) receives a proof from Lean. At high formal coverage, selecting the most-supported, Lean-proved answer is overwhelmingly reliable. At low formal coverage, however, the same procedure yields much lower correctness.

CovCal: Selective Risk-Control over Partial Formalization

To enforce rigorous, finite-sample control over the selective risk for accepting answers based on partial formalization, the authors introduce CovCal—a selector that incorporates metrics over the formalization coverage log and abstains unless pre-specified coverage criteria are met. The candidate answer pipeline decomposes the candidates into equivalence classes (e.g., numerically or algebraically equivalent answers), weighs these by their support (such as self-consistency frequencies), and attempts to formalize and verify each using Lean. The formalization process produces a status for each answer class: proved, typechecked, ill-typed, timeout, or unformalized.

CovCal computes diagnostics including:

Typed coverage: total probability mass of answer classes that successfully produce a Lean statement,
Proved coverage: total mass of classes for which a proof is found and kernel-checked,
Formal margin: difference in support between the highest-weight proved class and the strongest unresolved class.

CovCal returns an answer if and only if:

The proportion of candidate mass reaching Lean is above $\tau_{typ}$ ,
The proved mass is above $\tau_{prf}$ ,
The margin is above $\tau_M$ ,
No conflict exists among proved classes.

Otherwise, CovCal abstains, optionally falling back to a weaker non-formal selection method (e.g., self-consistency).

Coverage Cliff and Sparsity in Formal Verification

A central numerical finding is the empirical "coverage cliff": the accuracy of selecting the most-supported proved class is $96\%$ when proved coverage is high ( $C_{prf} \ge 0.75$ ), but only $20\%$ when coverage is low ( $C_{prf} < 0.25$ ). Similarly, partitioning by formal margin shows $98\%$ accuracy for a high margin, but just $10\%$ for negative margin scenarios. Thus, the reliability of Lean proofs in this context arises only when substantial candidate mass is successfully formalized and resolved.

(Figure 1)

Figure 1: Accepted fraction versus held-out selective-risk upper bound for CovCal and baseline selectors on the main run (dev-then-cal regime). The ε=0.15 risk contour delineates regions of valid acceptance; coverage-dependent selectors show sharply lower accuracy just below the certified threshold.

The formal verification signal is notably sparse and unfaithful: with a 7B autoformalizer, only $28\%$ of problems exhibit at least one proved answer, and a manual audit finds that just $\tau_{prf}$ 0 of those proofs faithfully correspond to the intended problem (many are vacuously true, e.g., "x = x", or true-but-irrelevant side facts).

Finite-Sample Risk Calibration: Bonferroni and Dev-Then-Cal Paradigms

CovCal employs two selective risk calibration regimes:

Bonferroni: For a fixed grid over the threshold parameters, select the cell maximizing coverage whose calibration-split Clopper–Pearson upper bound on accepted risk is below the target risk after a union-bound correction over the grid. This regime is conservative and may yield vacuous accept regions if the formal signal is too sparse.
Dev-Then-Cal: Select the threshold on an independent development split, evaluating the risk only once on the calibration split, enabling tighter certificates when coverage is limited.

The feasibility of risk-controlled selection depends fundamentally on autoformalization coverage. With sparse formalization (e.g., 7B Qwen2.5-Coder, $\tau_{prf}$ 1 coverage), the Bonferroni regime certifies no thresholds ( $\tau_{prf}$ 2 partitions in bootstrapped evaluation), while dev-then-cal is feasible only $\tau_{prf}$ 3 times, covering approximately $\tau_{prf}$ 4 of items at $\tau_{prf}$ 5 accuracy. With a prover-specialized formalizer (Goedel-Prover-V2-8B), coverage rises to $\tau_{prf}$ 6 and Bonferroni certifies $\tau_{prf}$ 7 partitions, accepting close to $\tau_{prf}$ 8 of examples at $\tau_{prf}$ 9 accuracy.

Contrast with Baselines and Practical Implications

For context, a pure self-consistency selector—without any Lean verification—already achieves $\tau_M$ 0 accuracy with full coverage. The contribution of risk-controlled formal selection, therefore, is not raw accuracy but the ability to provide a population-level certificate: accepted formal predictions are guaranteed (in the calibration sense) to have error rates below the designated threshold, conditional on the accept criteria.

The fallback mechanism restores full coverage by deferring unresolved cases to the self-consistency baseline, but only the formally accepted subset remains covered by the selective-risk certificate. The diagnostic risk upper bounds for confidence-only abstention and Lean-based selectors demonstrate a sharp tradeoff: higher acceptance fractions correlate with looser (i.e., less predictive) bounds unless formal coverage thresholds are enforced.

CovCal's strict abstention when coverage is inadequate aligns with the mathematical result that unresolved answer classes are indistinguishable given only partial formal logs, precluding non-trivial correctness guarantees except via abstention or further external validation.

Faithfulness Audits and the Limits of Autoformalization

Automated rational-evaluation and manual audits of Lean-proved statements reveal significant faithfulness deficits: only a minority of autoformalized, proved answers genuinely correspond to the problem as stated in natural language. Many high-confidence, formally proved answers do not capture the intended semantics—some assert irrelevant or vacuously true statements. This highlights a practical limitation: even with calibrated certificates over the selective subset, true end-to-end reliability depends on the faithfulness of autoformalization procedures, not just the success of Lean verification.

Theoretical Implications and Limitations

The work delivers a rigorous, distribution-free selective prediction certificate for partial-coverage formal reasoning pipelines, advancing state-of-the-art in safe formal answer selection. The theoretical claim that unresolved answer mass precludes non-vacuous certificates sharply delimitates the power of black-box verification: formalization and proving must achieve sufficient coverage and faithfulness before any downstream calibration scheme can guarantee error rates.

Limitations center on autoformalization fidelity and coverage—the certificate is effective only under sufficiently dense and faithful formalization. The empirical evaluation is also limited to MATH-500-style datasets, so transfer to other domains remains unproven. Lastly, the coverage cliff, sparsity of genuine proofs, and non-trivial rates of semantically-vacuous proofs all caution against over-reliance on coverage-based certificates in contemporary math LLM systems.

Future Directions

Advancement in autoformalization—either via larger, more specialized models or more robust semantic alignment—is pivotal for expanding the fraction of problems certifiable under risk-controlled selection. Enhanced diagnostics for faithfulness checking (possibly leveraging richer semantics or relational checks among proofs) might further increase the reliability and coverage of formal-selective methods. Integrating CovCal with hybrid agentic pipelines or employing retrieval-augmented autoformalizers could also push the coverage frontier.

Conclusion

This study demonstrates that the reliability of using Lean as a downstream judge for natural-language mathematical reasoning is sharply dependent on formal coverage. CovCal provides rigorous, distribution-free selective risk certificates for the subset of examples where sufficient formal evidence supports answer selection. The approach is informative and safe precisely when the autoformalizer is specialized enough to deliver high coverage; otherwise, abstention is not only prudent, but theoretically necessary. The work advances the calibration paradigm for formalized math reasoning with LLMs, making explicit the interplay between prover coverage, diagnostic faithfulness, and the attainable risk guarantees in modern selective prediction.

Markdown Report Issue