Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
184 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
45 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

InternLM2.5-StepProver: Advancing Automated Theorem Proving via Expert Iteration on Large-Scale LEAN Problems (2410.15700v1)

Published 21 Oct 2024 in cs.AI and cs.CL

Abstract: LLMs have emerged as powerful tools in mathematical theorem proving, particularly when utilizing formal languages such as LEAN. The major learning paradigm is expert iteration, which necessitates a pre-defined dataset comprising numerous mathematical problems. In this process, LLMs attempt to prove problems within the dataset and iteratively refine their capabilities through self-training on the proofs they discover. We propose to use large scale LEAN problem datasets Lean-workbook for expert iteration with more than 20,000 CPU days. During expert iteration, we found log-linear trends between solved problem amount with proof length and CPU usage. We train a critic model to select relatively easy problems for policy models to make trials and guide the model to search for deeper proofs. InternLM2.5-StepProver achieves open-source state-of-the-art on MiniF2F, Lean-Workbook-Plus, ProofNet, and Putnam benchmarks. Specifically, it achieves a pass of 65.9% on the MiniF2F-test and proves (or disproves) 17.0% of problems in Lean-Workbook-Plus which shows a significant improvement compared to only 9.5% of problems proved when Lean-Workbook-Plus was released. We open-source our models and searched proofs at https://github.com/InternLM/InternLM-Math and https://huggingface.co/datasets/internlm/Lean-Workbook.

Citations (1)

Summary

  • The paper introduces an expert iteration framework that leverages LLMs to iteratively refine proofs using the expansive Lean-Workbook-Plus dataset.
  • The paper demonstrates state-of-the-art performance with a 65.9% pass rate on MiniF2F and effective resource allocation via a critic model.
  • The paper highlights a log-linear relationship between CPU usage and solved proofs, offering new avenues for optimizing automated theorem proving techniques.

Analyzing InternLM2.5-StepProver: Advancing Automated Theorem Proving

The paper "InternLM2.5-StepProver: Advancing Automated Theorem Proving via Expert Iteration on Large-Scale LEAN Problems" explores improvements in automated theorem proving (ATP) by employing LLMs in the context of formal languages such as LEAN. The paper primarily emphasizes the application of an expert iteration framework using a large dataset, identified as Lean-Workbook-Plus, for enhancing the proving capabilities of these models.

Context and Methodology

Automated theorem proving is a significant challenge in artificial intelligence, demanding sophisticated reasoning and mathematical understanding. This research stands on the foundation of previous efforts, such as AlphaProof, which reached advanced levels in solving complex mathematical problems using LEAN.

The core methodological approach employed is expert iteration, wherein LLMs iteratively refine their understanding by continuously training on proofs generated during the process. The researchers utilized a large-scale dataset, Lean-Workbook-Plus, spending over 20,000 CPU days to derive insights about the proving strategies.

Key Findings and Contributions

  1. Performance Metrics: The work achieves state-of-the-art performance among open-source systems across several benchmarks, including MiniF2F, Lean-Workbook-Plus, ProofNet, and Putnam benchmarks. Notably, it achieves a 65.9% pass rate on MiniF2F-test and solves 17.0% of problems in Lean-Workbook-Plus, indicating a significant enhancement from previous attempts.
  2. Search Strategy and Resource Allocation: The integration of a critic model, which prioritizes problems based on perceived ease and potential depth of proof, enhances the efficiency of resource use. The researchers highlight a log-linear relationship between the number of solved problems, proof length, and CPU usage—pointing to strategic avenues for future research and resource allocation.
  3. Critic and Policy Model Interaction: The use of a critic model in guiding policy models has shown to be effective in discovering deeper proofs than naive best-first-search approaches. This interaction underscores a substantial advancement in searching long proof paths within complex problem spaces.
  4. Data Distribution Insights: Detailed analysis of CPU resource consumption and proof length provides insights into problem difficulty distribution, indicating that a majority of resources target problems that remain unsolved.

Practical and Theoretical Implications

The successful application of expert iteration on a large-scale dataset not only pushes the boundaries of what current ATP systems can achieve but also offers a strategic framework for future developments. By integrating critic models, the research provides a pathway to optimize search strategies further, which could lead to models capable of handling even larger and more complex problem sets.

Future Directions

The research opens multiple avenues for future advancements:

  • Scaling Up: As indicated by the log-linear scaling behavior, increasing the dataset or refining search algorithms promises further improvements.
  • Generalization and Versatility: Exploring applications beyond the current benchmarks to other theorem-proving domains could extend the model's utility.
  • Critic Model Evaluation: Developing reliable evaluation metrics for critic models could enhance their integration and optimization within ATP systems.

In conclusion, InternLM2.5-StepProver marks a significant step in leveraging advanced computational models within the domain of automated theorem proving, offering a robust methodology and compelling results that resonate with the ongoing evolution in AI-driven mathematical reasoning.

X Twitter Logo Streamline Icon: https://streamlinehq.com
Youtube Logo Streamline Icon: https://streamlinehq.com