ConjectureBench: Evaluating LLM Conjecturing

Updated 18 October 2025

ConjectureBench is a dataset and evaluation framework designed to isolate and assess the conjecturing phase for autoformalisation in mathematics.
It decouples the generation of mathematical conjectures from formal translation and proof synthesis using seen and unseen evaluation settings.
The framework employs advanced metrics like ConJudge and equiv_rfl, alongside methods such as Lean-FIRe, to significantly enhance model performance.

ConjectureBench is a specialized dataset and evaluation framework designed to measure and advance the conjecturing capabilities of LLMs within the broader task of autoformalisation of mathematics. It specifically addresses the often-overlooked but essential step of generating a mathematical conjecture—such as an explicit solution, bound, or proposition—prior to formalization and proof synthesis. By isolating conjecturing from subsequent formal translation and proof procedures, ConjectureBench establishes new metrics and benchmarks for accurately assessing and improving AI systems’ ability to formulate original mathematical insights.

1. Motivation and Design of ConjectureBench

ConjectureBench was constructed to fill a significant gap in the autoformalisation pipeline for mathematics. While prior benchmarks (e.g., PutnamBench, CombiBench) primarily target the translation of informal statements into formal mathematics, or focus on proof synthesis given a statement, they typically assume that the central conjecture—the explicit proposition to be proved—either appears in the input or is trivially extractable. Such assumptions confound the assessment of true “conjecturing” capability, conflating it with translation or proof. ConjectureBench intervenes by systematically removing the embedded conjectures from natural language problems, forcing LLMs to synthesize the candidate statement before any formal translation or proof attempt is possible.

ConjectureBench thus provides two settings for evaluation:

Seen: The conjecture (i.e., solution or claim) is given to the model.
Unseen: The model must generate both the conjecture and its formalization, starting only from the original informal statement.

This clear separation isolates the conjecturing phase and enables precise evaluation of a model’s ability to “guess” the required claim prior to formalization.

2. Conjecturing in Formal Mathematical Reasoning

Conjecturing is the process of hypothesizing a plausible explicit answer, proposition, bound, or statement that may serve as a proof target. In human mathematical practice, formalization usually requires the prior identification and articulation of an appropriate candidate—one cannot directly formalise an informal problem such as “Find all roots of $x^2 - 4x = 0$ ” without first conjecturing $x=0$ and $x=4$ as the solution set.

Within automated reasoning, treating conjecturing as a distinct preliminary step is critical. Models may possess the mathematical knowledge required to verify or prove a claim, but without the ability to discover a correct or complete conjecture, the autoformalisation pipeline cannot progress. The absence or incompleteness of a conjecture (e.g., listing only $x=0$ ) leads to incorrect or partial formalizations, even if the proof engines themselves are correct.

3. Evaluation Framework and Metrics

ConjectureBench introduces rigorously designed evaluation protocols that decouple conjecturing from the full formal translation process:

ConJudge: An LLM-as-a-Judge framework in which the model's proposed formalization is evaluated for correctness, with respect to both the gold conjecture and the formalized statement. Scores are reported as pass@1 and pass@10 for both seen and unseen settings, allowing direct comparison between scenarios where the conjecture is given or must be generated.
equiv_rfl: For standalone conjecture evaluation, this metric employs the Lean tactic rfl (reflexivity of definitional equality) to check whether the generated conjecture is definitionally identical to the gold. For example, for roots of a quadratic, only
1
abbrev conjecture : Set ℝ := {0, 4}
is correct if the gold answer also lists both roots.

These metrics permit fine-grained analysis of conjecturing versus proof translation, quantifying the isolated challenge of conjecture synthesis.

4. Impact on Model Evaluation and Downstream Tasks

Empirical evaluations on ConjectureBench reveal that foundational LLMs such as GPT-4.1 and DeepSeek-V3.1 demonstrate marked discrepancies between seen and unseen settings. In the seen setting (conjecture provided), pass@1 rates can be more than double those seen in the unseen case, with GPT-4.1’s pass@1 dropping from approximately 78.77% to 26.70% under baseline conditions. This sharp contrast demonstrates that previous assessments of model capability based on benchmarks that embed the answer overestimate performance by conflating translation/proof with conjecturing.

Failings are frequently attributable not to lack of mathematical knowledge, but to the inability to formulate the correct candidate claim. In standalone conjecturing, accuracy is nearly an order of magnitude lower than in scenarios where the conjecture is assumed. These findings require revised interpretations of model “performance” in formalization pipelines and motivate new training and prompting strategies.

5. Advances via the Lean-FIRe Inference-Time Method

Lean-FIRe is an inference-time methodology that significantly improves model performance on conjecturing and subsequent autoformalisation. It interleaves:

Chain-of-Thought (CoT): The model produces a natural language decomposition of the informal problem into key intermediate steps and reasoning processes.
Lean-of-Thought (LoT): Each stage of the CoT is mapped into precise Lean primitives, incrementally constructing the required formal specification and, crucially, the conjecture itself.

Seed example pairs (natural language and Lean translation) enable the LLM to align its reasoning with the requirements of formal mathematics, bridging the gap between intuition and executable syntax. Empirical results show Lean-FIRe increases conjecturing pass@1 by over 29% in the “unseen” setting and achieves the first successful end-to-end autoformalisation of 13 “no-answer” PutnamBench problems with GPT-4.1 (7 with DeepSeek-V3.1), validating the effectiveness of staged conjecture synthesis.

6. Implications for Future Research

The introduction of ConjectureBench and the associated findings highlight the centrality of conjecturing as a bottleneck in current mathematical reasoning systems. The work advocates for:

Creation of richer datasets targeting conjecture synthesis as a separable skill.
Explicit separation of conjecture generation from proof and translation in both training and evaluation.
Further development of inference-time reasoning techniques (such as Lean-FIRe) and prompt engineering that encourage multi-step, decompositional thinking.
Enhanced metrics (such as ConJudge and equiv_rfl) for diagnosing and tracking progress on conjecturing, independent of downstream formal proof capability.

Improving model performance on conjecturing is necessary for end-to-end automated mathematical reasoning, and future efforts must focus on robustly integrating hypothesis generation within the broader autoformalisation pipeline.

7. Example Tasks and Technical Details

Below is a representative schematic for the evaluation of conjecturing within ConjectureBench:

Input problem	Model Output (Unseen)	Gold Conjecture	Equiv_rfl passes if...
Find all $x$ such that $x^2-4x=0$	$\{0\}$	$\{0,4\}$	No (incomplete)
	$\{0,4\}$	$\{0,4\}$	Yes (definitionally eq.)

The corresponding Lean tactic check is:

1
2
3

abbrev conjecture_gold : Set ℝ := {0, 4}
abbrev conjecture_generated : Set ℝ := {0, 4}
theorem thm : conjecture_gold = conjecture_generated := by rfl

This illustrates that the completeness and precision of the conjecture are strictly enforced at the formal level, and nuances such as missing solutions are recognized by the evaluation metric.

Summary

ConjectureBench marks a transition in the evaluation and training of mathematical AI systems: from monolithic assessment conflating conjecturing and proof, toward a modular paradigm that rigorously isolates and measures the generative mathematical creativity at the core of formal reasoning. Through focused metrics, new inference methods, and dataset curation, it exposes the fundamental limitations of current systems and charts a program for their advancement in conjecture-centric mathematics.

PDF Markdown Chat (Pro)

Follow Topic

Get notified by email when new papers are published related to ConjectureBench.