Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
194 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
46 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

PutnamBench: Theorem Proving Benchmark

Updated 4 July 2025
  • PutnamBench is a multilingual benchmarking suite that meticulously formalizes 640 challenging Putnam Competition problems using Lean, Isabelle, and Coq.
  • It employs factored formalizations to separately address answer synthesis and proof justification, enabling precise evaluation of theorem proving systems.
  • Benchmark results show very low pass@n rates across both neural and symbolic methods, underlining its role as a rigorous challenge for advanced research.

PutnamBench is a multilingual, large-scale benchmarking suite developed for the rigorous assessment of automated and neural theorem provers on formalized competition mathematics, specifically those problems featured in the William Lowell Putnam Mathematical Competition. Recognized as one of the premier undergraduate mathematics contests in North America, the Putnam Competition includes some of the most challenging accessible problems in collegiate mathematics. PutnamBench provides hand-verified, cross-system formalizations of these problems and defines new standards for the evaluation of AI-driven mathematical reasoning systems.

1. Definition and Construction

PutnamBench comprises 640 unique Putnam contest problems, each meticulously formalized in both Lean 4 and Isabelle, with a substantial subset (417 problems) additionally formalized in Coq. In total, 1697 formalizations have been completed, each constructed and validated by multiple contributors to ensure correctness and alignment with the original informal text.

Every problem entry in PutnamBench contains:

  • The original English-language statement,
  • Its formalization in Lean 4 and Isabelle (and where available, Coq),
  • Custom, factored solutions: for problems requiring both a direct answer and proof, the solution is defined explicitly (e.g., as a set or function) so that answer-synthesis and proof tasks can be separated,
  • Consistent translation and hypothesis framing across proof assistant platforms, maintaining coherent mathematical meaning within the differing foundational libraries.

The choice of Lean 4, Isabelle, and Coq enables comparison across the major families of interactive theorem provers and increases relevance for diverse formal methods communities. Multilingual formalization supports portability and cross-benchmarking.

2. Mathematical and Topical Scope

PutnamBench is unique in its breadth and collegiate depth, offering broad topical coverage far beyond prior benchmarks such as MiniF2F or FIMO that are predominantly high-school level:

Area Number of Problems
Algebra 253
Analysis 226
Number Theory 107
Geometry 68
Linear Algebra 51
Abstract Algebra 28
Combinatorics 26
Probability 9
Set Theory 8

Problems often require cross-domain connections and sophisticated argument synthesis, incorporating advanced undergraduate concepts such as limits, matrix groups, field theory, and combinatorial constructions. Many challenge solvers to bridge gaps in standard libraries, operationalizing definitions, or constructing strategies that are not simple compositions of built-in lemmas.

3. Evaluation Methodology

PutnamBench is designed to be an open challenge for both symbolic and neural theorem proving systems. Benchmarking is conducted using the "pass@n" metric—the number of problems successfully solved in n proof attempts per problem.

Among the reference approaches tested are:

  • LLMs via in-context prompting (e.g., GPT-4);
  • Agentic proof-search methods such as COPRA and ReProver;
  • Hybrid strategies, notably the Draft-Sketch-Prove (DSP) approach (LLM-generated plan, symbolic proof search in Isabelle);
  • Native symbolic backends (e.g., Sledgehammer in Isabelle, CoqHammer for Coq, Tactician for Coq).

Reported outcomes indicate low absolute SOTA baseline, with no system solving more than a handful of problems (e.g., DSP achieves 4 of 640 in Isabelle; LLM-based approaches typically solve 0–1 per 640; all fail on the majority of the suite). Successes to date are limited to the easiest structurally simple problems (e.g., basic algebraic manipulations or set operations). The following table summarizes selected results from the inaugural release:

Method Lean 4 (Solved/640) Isabelle (Solved/640) Coq (Solved/417)
GPT-4 1 1 1
COPRA 1 1
DSP 4
Sledgehammer 3
Tactician 0
CoqHammer 0

This demonstrates that PutnamBench is an exceptionally difficult open challenge and is not currently vulnerable to solve-through-memorization or contamination by pretraining.

4. Design Rationale and Technical Structure

PutnamBench's formalizations are intentionally factored to separate the answer-finding and the proof justification tasks, enabling researchers to evaluate both abilities independently or together. For example:

1
2
3
4
5
6
7
abbrev solution : Set (ℝ → ℝ) :=
  { fun (x : ℝ) => x + n | n : ℤ } ∪
  { fun (x : ℝ) => -x + n | n : ℤ }
theorem putnam_2008_b5
  (fqsat : (ℝ → ℝ) → ℚ → Prop := ...)
  (fsat : (ℝ → ℝ) → Prop := ...) :
  ∀ f : (ℝ → ℝ), fsat f ↔ f ∈ solution := sorry

For real analysis, formalizations correctly leverage integral operators and higher-level mathematical constructs. Example (Coq):

1
2
3
4
5
6
Theorem putnam_1980_a5
  (n : nat) (npos : gt n 0) (coeff : nat -> R) (hcoeff : coeff n <> 0)
  (p : R -> R := fun x => sum_n (fun i => coeff i * x ^ i) (S n))
  (h1 : nat -> Prop := fun a => RInt (fun x => p x * sin x) 0 (INR a) = 0)
  (h2 : nat -> Prop := fun a => RInt (fun x => p x * cos x) 0 (INR a) = 0) :
    exists (m: nat), forall (b: nat), h1 b /\ h2 b -> lt b m.

Formalization style is harmonized across languages to the extent libraries permit, with consistent naming conventions, modular definition of hypotheses and goals, and use of contemporary proof assistant idioms.

Each formalization is verified by at least two contributors and may require up to 25 minutes per problem per language—a reflection of both mathematical difficulty and the rigor of the formalization process.

5. Research Significance and Opportunities

PutnamBench marks a substantial advancement over prior formal mathematics benchmarks. Its characteristics include:

  • Significantly increased problem difficulty, stymying both symbolic and neural approaches currently in the literature. Pass@n rates are typically below 1% across the benchmark.
  • Absence of problems easily "leaked" or memorized due to their inclusion in training sets, unlike smaller or older benchmarks.
  • Natural cross-compatibility for multi-system analysis through its trilingual foundation, aiding research in cross-lingual theorem proving and system-agnostic evaluation.
  • Explicit encouragement for future research to advance in areas such as synthesis-driven provers (that create new intermediate statements), retrieval and abstraction learning, and robust translation across proof assistant libraries.

By including answers for factored problems, PutnamBench supports precise measurement of not only proof search ability but also of answer synthesis/discovery—a crucial distinction for advanced mathematical reasoning systems.

6. Accessibility, Community Involvement, and Leaderboard

PutnamBench and all associated resources are open-sourced at https://github.com/TrishulLab/PutnamBench, with a public leaderboard available at https://trishullab.github.io/PutnamBench/. The repository contains all problem formalizations, standard licensing, and clear documentation for contribution.

The benchmark’s community-oriented approach invites expansion—researchers are encouraged to develop, evaluate, and submit results with new systems, further bolstering the resource's role as a focal point for formal methods and AI-math research convergence.

7. Implications for Automated Theorem Proving

PutnamBench establishes a durable, high-difficulty target for the field of AI formal mathematics, offering:

  • A defensible standard for substantive progress in neural-symolic theorem proving;
  • An environment that strictly requires creativity, abstraction, and the synthesis of intermediate mathematical results, not just incremental search or pattern matching;
  • A bridge for work spanning LLMs, symbolic reasoning, and formal verification.

As existing datasets become saturated or rendered less informative by model pretraining, PutnamBench stands as a challenging and methodologically sound benchmark, guiding research strategies towards robust, genuinely capable AI for mathematics. Its adoption is expected to inform both algorithmic directions and the evolution of evaluation strategies for years to come.