Benchmarking Testing in Automated Theorem Proving

Published 26 Apr 2026 in cs.CL and cs.FL | (2604.23698v1)

Abstract: Recent advances in LLMs have shown promise in formal theorem proving, yet evaluating semantic correctness remains challenging. Existing evaluations rely on indirect proxies such as lexical overlap with human-annotated proof, or expensive manual inspection. Inspired by the shift from lexical comparison to test-based evaluation in code generation, we propose T , a framework that evaluates the semantic correctness of formal theorems: a generated theorem is considered correct only if all dependent successor theorems compile successfully, analogous to integration testing. We construct a benchmark from 5 real-world Lean 4 repositories, comprising 2,206 problems paired with 41 successor theorems on average, automatically extracted without human effort. Experiments demonstrate that while state-of-the-art models achieve high compilation success, they perform significantly worse under our semantic metric. The best model, Claude-Sonnet-4.5, achieves only 38.9% Testing Accuracy on the full set, given both natural language proof and successor theorems as context, revealing a critical gap in current theorem generation capabilities.

Abstract PDF Upgrade to Chat

Authors (3)

Summary

The paper introduces T2, a successor theorem-driven testing framework and Testing Accuracy (TA) metric to evaluate semantic correctness in automated theorem proving systems.
It demonstrates that conventional compilation metrics yield high false positives, with models showing up to 80.3% compilation accuracy but only as low as 38.9% TA in rigorous testing.
The study highlights the need for dependency-aware training and enhanced context integration to bridge the gap between syntactic fluency and true semantic correctness.

Benchmarking Semantic Correctness in Automated Theorem Proving: The T2 Framework and Testing Accuracy Metric

Introduction

The application of LLMs to Automated Theorem Proving (ATP) has advanced rapidly, yielding the ability to generate formally verifiable proofs in proof assistants such as Lean. However, prevailing evaluation methodologies largely measure syntactic and logical correctness, commonly reducing to whether generated statements compile in isolation. Such compilation-based metrics are insufficiently stringent: they fail to assess whether the generated theorem semantically realizes the intended mathematical proposition. The paper "Benchmarking Testing in Automated Theorem Proving" (2604.23698) addresses this critical shortcoming, proposing a successor-theorem-driven testing framework called T2 and introducing Testing Accuracy (TA)—a rigorously automatic, semantic correctness metric for theorem generation.

Compilation-Based Evaluation vs. Semantic Correctness

Existing formal theorem proving benchmarks either depend on code compilation or rely on proxy metrics, such as surface-level lexical similarity (e.g., BLEU), entailment checks, or costly human inspection. The consensus has been that these approaches cannot guarantee that a generated theorem faithfully expresses the desired mathematics. The compilation metric, by construction, only ensures logical and type-correct statements within the formal system, leading to numerous false positives—where semantically incorrect responses still pass.

A paradigmatic example is illustrated below. Here, both a commutativity theorem and its tautological incorrect variant are accepted by compilation, yet their distinction is critical for downstream correctness.

Figure 1: Compilation alone cannot distinguish between a semantically correct commutativity theorem and a trivial tautology when tested in isolation; only integration testing with successor theorems exposes the tautology as incorrect.

The T2 Benchmark: Integration Testing and Semantic Verification

Motivated by integration testing in software engineering and the Curry-Howard correspondence linking proofs and programs, T2 operationalizes semantic evaluation by exploiting the dependency structure inherent in real-world mathematical developments. In T2, each generated theorem is subjected to all of its successor theorems—those that depend upon it, transitively within the proof repository. TA (Testing Accuracy) is defined as the fraction of dependency chains in which the replacement of the ground-truth theorem with the model's output does not break any successors.

This approach is instantiated over a large, automatically extracted Lean 4 benchmark (2,206 problems; ~41 successors each), providing rich, realistic semantic constraints. This structure enables T2 to distinguish between logically correct but semantically irrelevant outputs and genuinely correct theorem generations.

Quantitative Evaluation and Model Performance

Evaluation of 18 open and closed-source LLMs on T2 reveals substantial divergence between compilation-based and semantically-aware metrics:

High false positive rates: For some models, over 93% of compilable theorems are semantically incorrect as determined by T2.
Lexical metrics are unreliable: BLEU and similar surface comparison methods cannot distinguish correct from incorrect responses, as shown in cross-bin histograms.
Figure 2: Proportion of semantically correct (green) and incorrect (red) samples as a function of BLEU score—high BLEU does not guarantee semantic correctness.
Compilers Overfit Syntax: Many specialized provers exhibit higher compilation accuracy but do not improve in semantic correctness.
Dependency Coverage Increases Discrimination: As the number of enforced successor dependencies grows, TA becomes increasingly strict and meaningful, with most T2 benchmarks reaching depth 7 (over 1,600 successors).

Figure 3: (a) Distribution of successor theorem depth in the T2 Hard set; (b) Testing Accuracy sharply decreases as the number of successor constraints increases, indicating evaluation stringency.

Successor Theorem Context Matters: Supplying both the successor theorem and the NL proof context achieves the strongest TA scores, indicating the practical utility of dependency information in guiding LLM formalization.

Numerically, top-performing models such as Claude-Sonnet-4.5 obtain only 38.9% TA on the full test set, despite 80.3% compilation accuracy, and only 4.5% TA on the most challenging subset (T2 Hard). This 2× to 50× drop demonstrates the fundamental overestimation embedded in conventional evaluation practices.

Figure 4: Compilation gains (x-axis) for specialized formalization models do not translate to gains in Testing Accuracy (y-axis), underscoring the gap between syntactic fluency and semantic correctness.

Figure 5: Compilation and testing accuracy for a range of LLMs on T2 Hard, stratified by open/closed source and parameter scale. Fine-tuned and larger models improve compilation, but semantic correctness remains uniformly low.

Theoretical and Practical Implications

Theoretical

T2 promotes a formal view of semantic evaluation as observational equivalence in context, rather than mere syntactic validation. This aligns with foundational results in proof theory (e.g., Cut Elimination), operationalizing correctness through compositional behavior within the global mathematical system. The approach shifts the burden from unattainable full logical equivalence tests to comprehensive contextual behavioral testing, a move supported by deep analogies with software test coverage.

Practical

Tools and benchmarks rooted in T2 and TA provide greater alignment with real-world mathematical workflows, where theorems exist as interconnected nodes rather than isolated solutions. The framework is fully automatic and scalable, requiring no ground-truth reference proofs or annotation. Implementing better evaluation and trust metrics will be essential as LLMs are integrated into research mathematics, education, and formal verification pipelines.

Importantly, the severe semantic gap uncovered indicates that simply scaling models or further fine-tuning on domain-specific corpora is insufficient. Advances in LLM architectures, enhanced context integration, and dependency-aware training paradigms appear necessary to close this gap.

Limitations and Future Directions

The strictness of TA is contingent upon the depth and breadth of successor theorem coverage; standalone theorems or those with few dependents cannot benefit from this method. As with software testing, coverage cannot confer guarantees of completeness in error detection. The present implementation is restricted to Lean 4 but is designed to be extensible to other formal proof assistants with suitable dependency-tracking infrastructure.

Future work includes development of graded TA metrics (partial credit for subsets of satisfied dependencies), expansion to non-Lean proof systems, enrichment of repository coverage, and more robust natural language to formal statement mapping validated by domain experts.

Conclusion

T2 and the Testing Accuracy metric represent a substantive step forward in rigorously evaluating ATP systems for mathematical correctness, revealing a critical and persistent semantic gap in contemporary LLMs. The results underscore the inadequacy of syntactic and lexical evaluation metrics and the necessity of dependency-structured integration testing for meaningful progress. As theorem-proving models move towards deployment in serious mathematical and verification applications, adopting frameworks such as T2 will be essential for safe and trustworthy AI-assisted formalization.

Markdown Report Issue