- The paper presents a 25.3% autoformalization success rate when translating math competition problems into Isabelle/HOL using large language models.
- It shows that autoformalized theorems improve a neural theorem prover’s performance from 29.6% to 35.2% on the MiniF2F benchmark.
- The study leverages LLM-generated formal statements as additional training data, enhancing formal verification and program synthesis methodologies.
Autoformalization with LLMs
The paper "Autoformalization with LLMs" addresses the challenging task of translating natural language mathematics into formal specifications and proofs using LLMs. The authors highlight the potential impact of successful autoformalization in advancing fields such as formal verification, program synthesis, and artificial intelligence.
Key Contributions
- Translation Success: The paper presents evidence that LLMs attain a noteworthy success rate of 25.3% in autoformalizing mathematical competition problems into Isabelle/HOL, a formal theorem prover.
- Utility in Theorem Proving: The authors show that autoformalized theorems enhance the performance of a neural theorem prover, improving proof rates on the MiniF2F benchmark from 29.6% to a state-of-the-art 35.2%.
- Data Generation Methodology: The approach involves using LLMs to generate formal statements from natural language, which then serve as additional training data for theorem provers. This method leverages the capabilities of LLMs trained on vast corpuses to function in few-shot learning scenarios.
Methodology
The research focuses on case studies where LLMs formalize mathematical competition problems. For example, Codex, a GPT-based model, successfully translates problems into syntactically correct Isabelle code. The paper details a rigorous manual evaluation process to categorize failure cases and utilizes BLEU scores to quantitatively measure performance across different models and scales, with Codex outperforming other LLMs.
The authors employ expert iteration as a self-improvement paradigm, using successful proofs from autoformalizations to iteratively refine the theorem prover, Thor.
Implications and Future Work
The research opens up promising directions for integrating LLMs into formal systems, suggesting that continuous training and retrieval-augmented models could handle larger theories. The potential for cycle-consistency-based training and back-translation to improve translations is noted.
Challenges include aligning informal and formal definitions, handling logical gaps in proofs, and adapting to contexts that vary vastly in natural language.
Conclusion
This paper demonstrates the feasibility and utility of LLMs for autoformalization tasks, depicting them as valuable tools not only for translating natural language to formal language but also for enhancing automated theorem-proving systems. The authors propose promising avenues for future work to address the current limitations, aiming to create a feedback loop that could eventually rival human-level mathematical reasoning capabilities. This work lays a strong foundation for the pursuit of more extensive applications of LLMs in the domain of formal mathematics.