Papers
Topics
Authors
Recent
Detailed Answer
Quick Answer
Concise responses based on abstracts only
Detailed Answer
Well-researched responses based on abstracts and relevant paper content.
Custom Instructions Pro
Preferences or requirements that you'd like Emergent Mind to consider when generating responses
Gemini 2.5 Flash
Gemini 2.5 Flash 80 tok/s
Gemini 2.5 Pro 48 tok/s Pro
GPT-5 Medium 31 tok/s Pro
GPT-5 High 21 tok/s Pro
GPT-4o 86 tok/s Pro
GPT OSS 120B 454 tok/s Pro
Kimi K2 160 tok/s Pro
2000 character limit reached

Open Proof Corpus Benchmark

Updated 29 August 2025
  • Open Proof Corpus is a comprehensive dataset featuring over 5,000 human-validated proofs for 1,000+ Olympiad-level math problems.
  • It utilizes rigorous human annotation protocols with step-by-step reasoning and binary correctness labels by expert graders.
  • The corpus enables detailed analysis of LLM performance through best-of-n selection strategies, comparative metrics, and systematic error analysis.

The Open Proof Corpus (OPC) is a large-scale, human-validated dataset aimed at advancing the empirical paper and training of mathematical proof generation by LLMs. Designed to overcome the scarcity of broad, high-quality datasets of rigorously evaluated proofs, the OPC provides a standardized platform for benchmarking, analysis, and fine-tuning of automated proof generation systems on highly challenging mathematics problems, especially those from international and national olympiads.

1. Composition and Annotation Protocol

The Open Proof Corpus contains over 5,000 natural language proofs, each corresponding to one of more than 1,000 distinct mathematics problems sourced from competitions such as the International Mathematical Olympiad (IMO) and United States of America Mathematical Olympiad (USAMO). Each entry in the OPC includes:

  • The original problem statement;
  • A full, LLM-generated natural language proof;
  • A binary human evaluation label denoting proof correctness, accompanied by annotations and justifications by expert graders.

All proof evaluations are performed by mathematically qualified human assessors, including former IMO medalists. Guidelines are collaboratively formulated to ensure uniform grading criteria, demands for step-by-step reasoning, and the identification of minor errors such as skipped derivations or unjustified inferential leaps. This protocol produces high-granularity correctness judgments beyond mere final answer validity.

2. Research Questions and Coverage

The OPC was created to address fundamental questions in automated proof generation research:

  • Performance Differential Between Informal and Formal Proof Generation: The dataset supports direct comparisons by including LLM-generated informal proofs and human-checked results for problems that are common benchmarks for formal proof systems.
  • Final-Answer Accuracy vs. Proof Validity: The OPC makes it possible to quantify discrepancies between superficial answer correctness (computed solution) and full-proofs’ logical completeness and validity; it reveals that correctness rates for answers alone overestimate true proof reliability.
  • Impact of Best-of-n Selection: A major focus is the quantitative effect of various n-sample selection and ranking strategies (discrete, continuous, pairwise) on overall proof acceptance rates.

This structure enables analysis that is not possible with prior datasets constrained either by scale, by lack of human evaluation, or by a focus solely on formal proof scripts.

3. Model Evaluation and Selection Strategies

OPC’s structure allows detailed evaluation of LLMs involved in both proof generation and proof judging:

  • Model Diversity and Outputs: The corpus aggregates outputs from a spectrum of LLMs, including o4-mini, Gemini-2.5-Pro, o3, Grok-3-Mini, Qwen3-235B-A22B, and DeepSeek-R1, ensuring a representative survey of state-of-the-art performance on olympiad-level problems.
  • Evaluation Metrics: Majority voting and single-pass accuracy are both utilized, with Gemini-2.5-Pro achieving ∼88.1% accuracy in majority judgment—a match for or surpassing other closed and open-source baselines.
  • Judging Capabilities: The dataset supports evaluation of models’ abilities to serve as independent proof judges. Results indicate that, while current leading LLMs are fairly robust as critics, there remains a meaningful performance gap relative to human collective agreement, especially when critiquing self-generated solutions.

The methodology also reveals that best-of-n selection strategies greatly benefit from pairwise or ranking-based selection, with the Bradley-Terry model for paired comparisons:

P(i beats j)=11+exp(rjri)P(i \text{ beats } j) = \frac{1}{1 + \exp(r_j - r_i)}

delivering a relative gain of 10–17% in proof correctness over simpler scoring heuristics.

4. Applications, Impact, and Broader Context

The OPC serves a dual purpose:

  • Benchmark Resource: By supplying a diverse set of challenging mathematics problems with corresponding LLM proofs and rigorous human feedback, the OPC establishes a standardized benchmark for evaluating future proof generation systems.
  • Training Set for Fine-Tuning and Alignment: The project demonstrates that fine-tuning an 8B-parameter model on OPC (e.g., using RLHF methods such as GRPO) results in a verifier/judge model that matches or exceeds the best closed-source benchmarks. This enables open-source development of high-quality, verifiable mathematical LLMs.
  • Implications: The OPC provides a foundation for systematizing research on the improvement of LLMs’ mathematical reasoning, supports the move from answer-based to proof-based evaluation paradigms, and enables nuanced error analysis across a wide range of problem types and model architectures.

5. Technical Infrastructure and Methodological Innovations

The OPC workflow is underpinned by several technical and organizational protocols:

  • Proof Generation Pipeline: LLMs produce proposed proofs, which are then systematically graded by human experts using a web-based annotation interface.
  • Grading Protocols: Detailed, collaboratively maintained rubrics guide judgments about acceptable levels of minor errors, oversights, or unwarranted “trivial” steps, ensuring evaluator consistency and completeness.
  • Best-of-n Selection: Both discrete scoring, continuous rating, and pairwise ranking tournaments (both bracket and Swiss systems) are tested for optimal selection of output proofs from multiple LLM candidates.
  • LaTeX and Diagrammatic Support: Mathematical notation and TikZ diagramming are employed for unambiguous presentation and to aid both human and machine understanding.

6. Comparative Findings on Benchmarking and Error Analysis

The OPC enables direct, data-driven comparison of LLM proof capabilities to formal prover pipelines. Empirically, natural LLMs such as Gemini-2.5-Pro currently far outperform formal-prover models (e.g., DeepSeek-Prover-V2), which exhibit notably lower rates of correct solutions on olympiad benchmarks. The dataset allows researchers to examine error types, frequencies, and sources both in generated proofs and in model evaluations, clarifying that final-answer benchmarks systematically overreport model performance relative to full-proof validity.

A plausible implication is that future progress in automated mathematics will critically depend on datasets and evaluation frameworks—such as the OPC—that foreground stepwise reasoning and rigorous validation over superficial result matching.

7. Prospects and Research Directions

Developers and researchers leveraging the OPC are positioned to explore:

  • Optimization of best-of-n proof selection strategies using ranking-driven evaluation;
  • Improvements in LLM self-critique and verification systems;
  • The development of hybrid informal-formal proof systems using OPC data to cross-calibrate generation and validation modules.

The dataset is anticipated to both accelerate progress toward reliable automated theorem proving and refine the metrics by which such systems are evaluated, with broader implications for LLM-based mathematics education, research, and applications requiring high-assurance reasoning.


In summary, the Open Proof Corpus provides a comprehensive, rigorously evaluated resource for benchmarking, training, and analysis in the rapidly advancing field of LLM-driven mathematical proof generation. Its size, annotation protocol, and methodological emphasis on full-proof correctness distinguish it as a critical infrastructure for research into verifiable, trustworthy automated mathematics (Dekoninck et al., 23 Jun 2025).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)