Papers

Topics

Authors

Recent

View all

Assistant

AI Research Assistant

Well-researched responses based on relevant abstracts and paper content.

Custom Instructions Pro

Preferences or requirements that you'd like Emergent Mind to consider when generating responses.

Gemini 2.5 Flash

Gemini 2.5 Flash 82 tok/s

Gemini 2.5 Pro 61 tok/s Pro

GPT-5 Medium 35 tok/s Pro

GPT-5 High 36 tok/s Pro

GPT-4o 129 tok/s Pro

Kimi K2 212 tok/s Pro

GPT OSS 120B 474 tok/s Pro

Claude Sonnet 4.5 37 tok/s Pro

2000 character limit reached

IMProofBench: Math Proof Benchmark

Updated 1 October 2025

IMProofBench is a benchmark designed to assess LLMs' ability to generate detailed, research-level mathematical proofs across diverse domains.
It uses a hybrid evaluation framework that combines automated final-answer scoring with expert human grading of full proofs for rigorous analysis.
The setup features interactive, multi-turn reasoning with advanced computational tools, simulating authentic research workflows.

IMProofBench is a research-level benchmark designed to evaluate the capabilities of LLMs in generating and reasoning about mathematical proofs. Unlike existing evaluations focused on final answers or competition-level problems, IMProofBench tests deep mathematical reasoning, mirroring authentic research workflows. The benchmark comprises rigorously peer-reviewed problems, diverse mathematical domains, and a hybrid evaluation setup incorporating both automated grading and expert human assessment.

1. Benchmark Architecture and Composition

IMProofBench consists of 39 peer-reviewed mathematical problems developed by expert mathematicians across algebraic geometry, combinatorics, graph theory, stochastic analysis, and bioinformatics, with continual expansion via a problem creation pipeline. Each problem is designed for rigorous proof generation, distinguishing IMProofBench from prior datasets centered on final-answer questions or high-school competition formats (Schmitt et al., 30 Sep 2025).

Problems in IMProofBench are paired with subproblems requesting unique, automatically gradable final answers (e.g., asking for a numerical quantity such as $N_3$ ). This hybrid design facilitates large-scale quantitative analysis while enabling the assessment of detailed proof-writing abilities through complementary human evaluation.

2. Evaluation Framework and Computational Environment

The IMProofBench evaluation framework simulates a realistic mathematical research environment. LLMs operate in a multi-turn, agentic manner under the Inspect framework, emulating workflows of practicing mathematicians. Models are granted access to essential computational and research tools:

Python for scientific computing (NumPy, SciPy, Sympy).
Bash shell with access to algebra systems (GAP, Maxima, PARI/GP, Singular).
SageMath for advanced mathematical computations.
Web search for literature review and discovery.
Multi-turn interaction supporting iterative reasoning and tool use.

This setup supports extended, context-rich reasoning, including intermediate computations and literature consultation, akin to drafting a mathematical manuscript.

3. Scoring Criteria, Metrics, and Quantitative Outcomes

The benchmark distinguishes between two primary modes of assessment: automated scoring on final-answer subproblems and human evaluation of proof generation.

Proof-Based Evaluation: Human experts evaluate the correctness and mathematical rigor of full proofs. For example, GPT-5 achieved a correct and complete solution in 22% of benchmark problems. Models are assessed for logical soundness, conceptual accuracy, and proper structuring of arguments.
Final-Answer Evaluation: Automated scripts assess subproblem answers; Grok-4 attained the highest quantitative accuracy at 52%, outperforming GPT-5 (42%). The benchmark correlates machine-scored subquestion results with human-graded progress scores (correlation coefficient 0.45), demonstrating that final-answer performance is a partial but imperfect proxy for reasoning ability.

A representative problem involves finding a closed formula for %%%%2%%%%, the number of isomorphism classes of stable graphs with three edges, where the answer is expressed via a piecewise function:

$N_g = \begin{cases} \frac{1}{9}g^3 + \frac{7}{8}g^2 + \frac{5}{12}g - 2 & \text{if } g \equiv 0 \pmod{6} \ \frac{1}{9}g^3 + \frac{7}{8}g^2 + \frac{1}{6}g - \frac{155}{72} & \text{if } g \equiv 1 \pmod{6} \ \frac{1}{9}g^3 + \frac{7}{8}g^2 + \frac{5}{12}g - \frac{20}{9} & \text{if } g \equiv 2 \pmod{6} \ \frac{1}{9}g^3 + \frac{7}{8}g^2 + \frac{1}{6}g - \frac{19}{8} & \text{if } g \equiv 3 \pmod{6} \ \frac{1}{9}g^3 + \frac{7}{8}g^2 + \frac{5}{12}g - \frac{16}{9} & \text{if } g \equiv 4 \pmod{6} \ \frac{1}{9}g^3 + \frac{7}{8}g^2 + \frac{1}{6}g - \frac{187}{72} & \text{if } g \equiv 5 \pmod{6} \end{cases}$

4. Interactive and Agentic Model Evaluation

IMProofBench leverages an agentic evaluation paradigm, where models interactively use scientific tools and search engines to support their reasoning. This design aims to approximate genuine research processes, including literature review, iterative computation, and multi-step deduction.

The framework separates intermediate reasoning from final answers, mirroring the methodology of human mathematicians drafting and revising research output. This multi-stage process promotes the assessment of reasoning as a dynamic activity rather than a static prediction.

5. Peer Review, Community Production, and Benchmark Dynamics

All problems in IMProofBench are authored and peer-reviewed by professional mathematicians, with over 23 contributors involved. Workshops, personal outreach, and academic networking fuel the collaborative production process. Contributors benefit from privileged access to frontier LLMs and opportunities for co-authorship.

The benchmark is explicitly designed for continuous expansion and active curation:

The problem set is targeted to increase to 150–300 problems, maintaining difficulty and relevance.
Problems may be retired or adjusted in response to new research, ensuring that IMProofBench remains both challenging and current.
Institutions and companies may utilize the benchmark for transparent internal model evaluation.

Each problem’s grading combines automatic assessment of subquestions and detailed human feedback, fostering broad consistency, transparency, and reproducibility.

6. Modeling Challenges, Insights, and Error Analysis

IMProofBench surfaces recurrent error modes among LLMs:

Logical errors: Flawed deductive steps or failed invocation of relevant theorems.
Conceptual errors: Misinterpretation of problem statements or incorrect application of mathematical constructs.
Hallucination: Fabrication of unsupported claims or invalid reasoning paths.

The observed gap between final-answer accuracy and full-proof quality underscores the inadequacy of answer-only benchmarks for research-level mathematics. A plausible implication is that next-generation LLMs must integrate deep mathematical reasoning with structured tool-use and literature synthesis capabilities.

7. Future Directions and Impact

IMProofBench is intended as a dynamic benchmark, regularly updated via workshops and direct collaborations with the mathematical community. As AI systems progress, the benchmark will support the development of models better suited for mathematical research partnership, providing granular diagnostic feedback and exposing persistent limitations in mathematical reasoning. Its continued evolution will ensure sustained relevance for evaluating the mathematical intelligence of LLMs (Schmitt et al., 30 Sep 2025).

More detailed documentation and problem examples are available via the IMProofBench website.

PDF Markdown Chat (Pro)

References (1)

IMProofBench: Benchmarking AI on Research-Level Mathematical Proof Generation (2025)

Follow Topic

Get notified by email when new papers are published related to IMProofBench.

IMProofBench: Math Proof Benchmark

1. Benchmark Architecture and Composition

2. Evaluation Framework and Computational Environment

3. Scoring Criteria, Metrics, and Quantitative Outcomes

4. Interactive and Agentic Model Evaluation

5. Peer Review, Community Production, and Benchmark Dynamics

6. Modeling Challenges, Insights, and Error Analysis

7. Future Directions and Impact

Follow Topic

Continue Learning

Don't miss out on important new AI/ML research

IMProofBench: Math Proof Benchmark

1. Benchmark Architecture and Composition

2. Evaluation Framework and Computational Environment

3. Scoring Criteria, Metrics, and Quantitative Outcomes

4. Interactive and Agentic Model Evaluation

5. Peer Review, Community Production, and Benchmark Dynamics

6. Modeling Challenges, Insights, and Error Analysis

7. Future Directions and Impact

Follow Topic

Continue Learning

Related Topics

Don't miss out on important new AI/ML research