Pseudo-Formalization for Automatic Proof Verification

Published 19 May 2026 in cs.LO and cs.LG | (2605.20531v1)

Abstract: Reliable verification of proofs remains a bottleneck for training and evaluating AI systems on hard mathematical reasoning. Fully formal proofs, in languages like Lean, are easy to verify because they are unambiguous and modular. Most proofs, particularly those written by AI systems, have neither property, and translating them into formal languages remains challenging in many frontier math settings. We propose Pseudo-Formalization (PF), a proof format that captures the modularity and precision of formal proofs while retaining the flexibility of natural language. A Pseudo-Formal proof is decomposed into self-contained modules, each stating its premises, conclusion, and proof in natural language. To verify the correctness of a regular natural language proof, an LLM translates it to Pseudo-Formal and then verifies each module independently, an algorithm we call Block Verification (BV). We evaluate PF+BV on two benchmarks spanning olympiad and research-level mathematics, where it pareto-dominates LLM-as-judge baselines on error-finding precision and recall. To support future work, we release our research-level proof verification benchmark ArxivMathGradingBench.

Abstract PDF Upgrade to Chat

Authors (5)

Summary

The paper presents a novel PF+BV framework that transforms natural language proofs into modular, verifiable components using block verification.
It employs LLMs to independently check each proof segment, reducing context requirements and minimizing false positives.
Empirical evaluations demonstrate that PF+BV achieves near 90% recall and superior error localization on both competition-level and research-level proofs.

Pseudo-Formalization for Automatic Proof Verification: Formalization Framework and Empirical Evaluation

Introduction and Motivation

The paper "Pseudo-Formalization for Automatic Proof Verification" (2605.20531) addresses the persistent challenge in AI-driven mathematics: automatic, reliable verification of mathematical proofs written in natural language, especially those by AI systems. While fully formal proofs in languages such as Lean or Isabelle admit mechanical verification due to their semantic precision and modular structure, the majority of proofs produced by humans or AI do not fit this mold, and translation into fully formal languages remains infeasible in many advanced mathematical contexts. The authors introduce Pseudo-Formalization (PF), a structured proof format retaining the modularity and precision necessary for verification, but with the flexibility of natural language. PF decomposes a proof into self-contained modules, each declaring its premises, conclusion, and proof. The Block Verification (BV) algorithm is applied, using LLMs to verify each module independently. This provides scalable oversight for mathematical argumentation and enables reliable proof verification beyond the domains accessible to current formal proof systems.

Figure 1: PF verification pipeline: translation from natural language proof to PF representation and block-wise verification, showing precision-recall deficit against LLM-as-judge baselines.

Pseudo-Formal Proofs: Structure and Theoretical Properties

PF proofs are defined as sequences of modules $(P, c, \pi)$ , corresponding to premises, conclusion, and proof respectively, all in natural language. Proof structure is encoded via:

A directed acyclic graph (DAG) $G$ capturing invocation dependencies, i.e., which modules invoke others.
A scope-inheritance forest $T$ for hierarchical organization of premises and context.

This modular structure allows each block to be verified locally, leveraging minimal context and explicit premises. The theoretical analysis formalizes modularity benefits:

The context required for block-wise verification of PF proofs is independent of proof length, i.e., $O(L)$ where $L$ bounds local block length, as opposed to $O(n)$ context for whole-proof verification, where $n$ is total proof length.
The modularization enables verification in the short-context regime, mitigating context rot observed in LLMs when context exceeds their effective capacity.

The authors define "Good" PF proofs as those satisfying bounded scope depth and concise blocks; formally, $\operatorname{depth}(T) < D$ and $|P|,|c|,|\pi| < L$ for constants $D,L$ independent of $G$ 0. The context-cost theorem demonstrates that a fixed-size transformer can simulate block-wise verification by aggregating $G$ 1 calls, each of size $G$ 2.

Figure 1: Diagram of PF block decomposition for proof verification.

PF+BV Pipeline and Verification Procedure

The pipeline for PF+BV proceeds in four stages:

Translation: Natural language proofs are rewritten into PF format via LLMs, optionally employing a self-repair loop to patch discrepancies between the original proof and its pseudo-formalized version.
Block Verification: Each module is independently checked for correctness by a verifier LLM, which accepts or flags errors, assuming children's correctness.
Calibration: Error reports are aggregated by a calibrator LLM, parameterized by natural language strictness, to issue a final verdict—binary, numerical, or error-located.
Parallel Aggregation: All prior steps are run in $G$ 3 parallel rollouts; pessimistic aggregation is employed for error recall maximization (proof passes only if all rollouts accept).

This modularity, together with pessimistic parallel scaling, allows performance tuning along the precision-recall axis, adapting to stricter or more permissive downstream requirements.

Experimental Evaluation

The authors conduct empirical evaluation on two benchmarks:

Hard2Verify: Olympiad/Putnam-level proofs generated by frontier AIs, with step-level correctness labels.
ArxivMathGradingBench: A new dataset of 35 research-level arXiv papers with known author-corrected errors, capturing realistic error localization scenarios.

Strong numerical results include:

PF+BV achieves near-90% recall at step level on Hard2Verify, closely matching the baseline but demonstrating superior performance on longer, harder proofs in ArxivMathGradingBench.
On ArxivMathGradingBench, PF+BV significantly outperforms the baseline on recall and error localization, raising fewer false alarms and covering a larger fraction of annotated errors per paper.
PF+BV provides a Pareto-dominant precision-recall tradeoff relative to LLM-as-judge at all recall levels, and reduces mean number of false errors per proof by over 20% at $G$ 4.
Figure 2: Left: Step-level recall as a function of number of parallel verification attempts ( $G$ 5). PF+BV achieves higher recall and coverage, especially on longer proofs. Middle: Mean number of false errors per proof/paper. PF+BV flags fewer false errors. Right: Coverage of true errors per proof/paper.

The qualitative analysis demonstrates PF+BV's ability to localize errors. In the IMO 2024 P6 example, modularization isolates a logical misreading—strengthening "cannot contain both $G$ 6 and $G$ 7" to "all nonzero elements have same sign"—allowing the verifier to flag a non-trivial error missed by holistic approaches.

PF+BV advances several directions in automatic proof verification:

Training legible outputs for verification via prover-verifier games, scalable oversight, and explicit modularization.
Comparison with autoformalization pipelines, which translate natural-language proofs to fully formal languages for mechanized verification, but require complete formal libraries and incur high translation cost.
Modular checking complements prior chunk-wise strategies (fixed-size chunk verification by LLMs) and tournament-style proof comparison.

The PF+BV framework is more broadly applicable in domains with modular argumentation beyond mathematics—for instance, empirical sciences or law—where arguments can be decomposed into explicit premises and conclusions.

Limitations and Future Work

Several limitations are acknowledged:

Production of faithful PF rewrites is non-trivial; translation fidelity is empirically improved via self-repair, but does not guarantee correctness.
Benchmark error labels are proxies for true error identification, potentially undercounting errors detected by PF+BV.
PF+BV is currently restricted to mathematical proofs; extension to other domains requires adaptation of modularization and explicit premise-conclusion decomposition.

Future directions include training models to directly produce PF-format proofs, deeper integration with legibility-target architectures, and extension to additional scientific domains.

Conclusion

Pseudo-Formalization (PF) and Block Verification (BV) provide a rigorous, scalable framework for automatic verification of mathematical proofs in natural language, bridging the gap between formal and informal proof formats. The PF+BV pipeline enables modular, short-context verification that demonstrably outperforms strong baseline LLM-as-judge systems on both competition-level and research-level benchmarks, with reduced false positives and improved error localization. Its theoretical guarantees and empirical results establish PF+BV as a viable foundation for reliable proof verification in advanced mathematical reasoning, with practical implications for AI training, referee processes, and scalable evaluation of machine mathematics.

Markdown Report Issue