Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
143 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
46 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Neural Theorem Proving: Generating and Structuring Proofs for Formal Verification (2504.17017v1)

Published 23 Apr 2025 in cs.AI, cs.FL, cs.LG, and cs.LO

Abstract: Formally verifying properties of software code has been a highly desirable task, especially with the emergence of LLM-generated code. In the same vein, they provide an interesting avenue for the exploration of formal verification and mechanistic interpretability. Since the introduction of code-specific models, despite their successes in generating code in Lean4 and Isabelle, the task of generalized theorem proving still remains far from being fully solved and will be a benchmark for reasoning capability in LLMs. In this work, we introduce a framework that generates whole proofs in a formal language to be used within systems that utilize the power of built-in tactics and off-the-shelf automated theorem provers. Our framework includes 3 components: generating natural language statements of the code to be verified, an LLM that generates formal proofs for the given statement, and a module employing heuristics for building the final proof. To train the LLM, we employ a 2-stage fine-tuning process, where we first use SFT-based training to enable the model to generate syntactically correct Isabelle code and then RL-based training that encourages the model to generate proofs verified by a theorem prover. We validate our framework using the miniF2F-test benchmark and the Isabelle proof assistant and design a use case to verify the correctness of the AWS S3 bucket access policy code. We also curate a dataset based on the FVEL\textsubscript{\textnormal{ER}} dataset for future training tasks.

Summary

Neural Theorem Proving: Generating and Structuring Proofs for Formal Verification

The accelerating emergence of LLM-generated code presents new challenges and opportunities for formal verification and mechanistic interpretability. While LLMs have achieved remarkable success in generating code in languages such as Lean4 and Isabelle, successfully addressing formal verification of complex software systems requires a robust approach to theorem proving that transcends code generation alone. This paper introduces a comprehensive framework aimed at generating entire proofs in formal languages that can be seamlessly integrated with built-in tactics and off-the-shelf automated theorem provers. It leverages the capabilities of LLMs for formal verification tasks, forming a bridge between linguistic interpretation and symbolic reasoning.

Framework and Methodology

The proposed framework comprises three principal components: generation of natural language statements related to the code subject to verification, an LLM tasked with generating formal proofs, and heuristic-based modules for building final proofs. The framework employs a two-stage fine-tuning process involving SFT-based training for syntactically correct Isabelle code generation, followed by RL-based training to enhance semantic validity. Notably, the framework's fine-tuning regime evaluates and improves the reasoning capabilities of LLMs across diverse domains.

Numerical Results and Use Case

Evaluation reveals substantial improvements: the fine-tuned model ProofSeek demonstrated a 3% improvement in proof success rate compared to its predecessor DeepSeek, while also achieving a notable 20% reduction in execution time. A practical application includes verifying the correctness of AWS S3 bucket access policy code, demonstrating the framework's utility beyond traditional mathematical proofs.

Implications and Future Directions

The implications of this research are both practical and theoretical. Practically, it provides a novel mechanism for verifying and validating code, policies, and system statements with potential applications in cybersecurity, regulatory compliance, and automated quality assurance. Theoretically, it advances the capabilities of neural theorem proving in environments requiring intricate interaction between natural language and formal mathematical reasoning.

Looking ahead, leveraging the reliability of LLM-generated proofs across different systems could further enhance the robustness of automated theorem proving. Future work can focus on integrating additional symbolic systems, such as knowledge graphs, to construct proofs, fostering a more consistent and reliable automated theorem proving process. Enhanced training processes and feedback mechanisms, potentially including critique and reinforcement strategies, may lead to even greater improvements in LLM reasoning capabilities in formal verification contexts.

This research positions LLMs as viable agents in formal verification workflows, offering a promising avenue for deeper integration of machine learning within critical software development processes and broadening the scope of automated reasoning technologies.