- The paper introduces PutnamBench, a benchmark with 640 formalized Putnam theorems that rigorously tests neural theorem-provers.
- It employs three formal proof frameworks—Lean 4, Isabelle, and Coq—to cover diverse undergraduate mathematical topics.
- Experimental results reveal that current systems solve only a few problems, highlighting the need for advanced neurosymbolic methods.
PutnamBench: Evaluating Neural Theorem-Provers on the Putnam Mathematical Competition
Overview
This paper presents PutnamBench, a novel multilingual benchmark specifically designed to evaluate the capabilities of neural theorem-provers on complex mathematical problems sourced from the William Lowell Putnam Mathematical Competition. The benchmark includes 1697 formalizations of 640 Putnam theorems, implemented in three prominent theorem-proving frameworks: Lean 4, Isabelle, and Coq. By requiring proficiency in a broad spectrum of undergraduate-level mathematics — encompassing algebra, analysis, and number theory among other fields — PutnamBench establishes a rigorous challenge for current and future research in automated theorem proving.
Motivation and Background
Automating mathematical reasoning has long been a significant objective in artificial intelligence research. Generally, progress in this domain has often been gauged through benchmarks designed to test the problem-solving abilities of theorem-proving systems. Existing benchmarks like MiniF2F and FIMO focus on aspects of high-school level mathematics or target single formal languages. Additionally, these benchmarks sometimes suffer from issues like easy solvability through SMT solvers or limited language support (e.g., FIMO's exclusive focus on Lean 3).
Given the rising prominence of LLMs in directing formal theorem-proving, there is a critical need for benchmarks that preclude leakage between training and evaluation sets to ensure the robustness of AI systems. The paper responds to this need by introducing a comprehensive and diversified benchmark in PutnamBench.
Structure and Features of PutnamBench
Diversity and Breadth:
PutnamBench's formalizations span a wide range of mathematical topics including:
- Analysis: Limits, integrals, and derivatives.
- Linear Algebra: Matrices and determinants.
- Abstract Algebra: Rings, groups, and fields.
- Geometry: Euclidean problems.
- Number Theory: Primes and divisors.
- Combinatorics: Countability and discrete structures.
This diversity ensures the benchmark comprehensively covers undergraduate-level mathematics, beyond the high-school level difficulties seen in prior benchmarks.
Multilinguality:
Unlike existing benchmarks, PutnamBench includes formalizations in three formal proof languages:
This multilingual approach makes PutnamBench useful across different communities within automated theorem proving, promoting cross-pollination of ideas and methodologies.
Factored Solutions:
Approximately 60% of Putnam problems require not just formal proof but also an explicit numerical or functional solution. PutnamBench handles this by factoring out the solution within the formal theorem statements, making it necessary for the theorem-provers to synthesize the solution before proving it correct.
Manual Construction and Verification:
The formalizations were manually constructed and verified for correctness. This meticulous approach ensures high-quality and reliable benchmarks, maintaining the integrity of evaluations conducted using PutnamBench.
Experimental Evaluation
The research evaluates several state-of-the-art theorem-proving approaches using PutnamBench, across all three languages. The tested models include:
- Lean 4: GPT-4, COPRA, ReProver.
- Isabelle: GPT-4, Draft-Sketch-Prove (DSP) pipeline, Sledgehammer.
- Coq: GPT-4, COPRA, Tactician, CoqHammer.
Results and Analysis
The empirical evaluations reveal that current methods—whether neural, symbolic, or neurosymbolic—struggle significantly with PutnamBench:
- Lean 4: Only a single problem was solved by GPT-4 and COPRA.
- Isabelle: DSP and Sledgehammer collectively solved a handful of problems, showcasing the importance of symbolic automation.
- Coq: Similar to Lean 4, few problems were solved using neural approaches, and none using symbolic tools alone.
The implications are clear: while state-of-the-art methods can handle straightforward problems and certain structured problem types (e.g., binary operations on sets), they fall short in the face of complex, competition-level mathematics. This highlights a crucial research gap and an opportunity for further breakthroughs in automated reasoning systems.
Conclusion and Future Directions
PutnamBench constitutes an invaluable resource for the theorem-proving community, offering a robust, diversified, and challenging set of benchmarks. The poor performance of current methods underscores the need for innovative algorithms capable of synthesizing auxiliary lemmas and effectively leveraging vast mathematical repositories. Future research could focus on creating hybrid systems integrating symbolic reasoning with deep learning, improving the understanding and generalization capabilities of LLMs, and exploiting the multilingual nature of formal proofs.
PutnamBench sets a high bar, pushing the boundaries of what is achievable in the automation of mathematical insight, and driving forward the continual evolution of AI in formal theorem proving.