NaturalProver: Knowledge-Grounded Math Proofs

Updated 4 November 2025

NaturalProver is a knowledge-grounded neural language model that produces textbook-style mathematical proofs by conditioning on explicit theorems and definitions.
It employs a dual-objective training approach and a Stepwise++ decoding strategy to ensure that all provided references are utilized in generating logically coherent proofs.
Empirical evaluations demonstrate improved success rates in proof correctness and step usefulness, highlighting its potential for educational tools and research applications.

NaturalProver is a knowledge-grounded neural LLM system for generating mathematical proofs in natural mathematical language—that is, the mixture of formal symbols and natural language used by mathematicians—by conditioning on explicit background knowledge in the form of named theorems and definitions. Developed to advance the state of theorem proving in the natural style of mathematical discourse, NaturalProver is benchmarked on the NaturalProofs-Gen corpus and represents a distinct step beyond generic LLMs or formal proof generators by treating the production of textbook-style proofs as a reference-utilizing, constraint-guided sequence generation task (Welleck et al., 2022).

1. Problem Definition and Task Structure

NaturalProver addresses two core generative tasks aligned with mathematical practice:

Next-step suggestion: Given a theorem statement, a partial proof (context), and a list of reference titles (theorems/definitions)—either retrieved or provided—generate the next plausible proof step as continuation.
Full proof generation: Given a theorem and a set of references, generate the entire stepwise proof, structuring arguments that blend formal symbolic calculations with prose justifications.

These tasks are grounded in the mathematical workflow, where constructing sensible, human-like proofs requires not only stepwise symbolic manipulation but also the context-sensitive deployment of established mathematical facts.

2. Model Architecture and Knowledge Conditioning

NaturalProver fine-tunes a GPT-3 "Curie" (or related decoder-based LM) on natural proof generation, augmented with explicit reference grounding:

Input encoding includes the theorem (title and content) and a (retrieved or gold) set of reference titles. The model receives the structured input:

<theorem> <title> ... </title> <content> ... </content> </theorem>
<ref> <ref-title-1> </ref> ... <ref> <ref-title-R> </ref>
<proof>
[proof tokens]

Proof targets are stepwise natural mathematical language tokens, referencing symbolically the required theorems and definitions.
Auxiliary training objective: The model is also trained to reconstruct the full content of each reference title provided, thereby internalizing a mapping from symbolic reference to plausible mathematical content. The overall training loss is:

$\mathcal{L}(\theta) = \frac{1}{|\mathcal{D}^\text{train}| + |\mathcal{R}^\text{train}|} \left[\sum_{(x, y) \in \mathcal{D}^\text{train}} -\log p_\theta(y | x, R_\text{title}) + \sum_{r \in \mathcal{R}^\text{train}} -\log p_\theta(r_{\text{content}} | r_{\text{title}}) \right]$

This dual-objective setup ensures the model is both grounded in, and capable of invoking, existing knowledge in proofs.

3. Constrained Stepwise Generation (“Stepwise++” Decoding)

To encourage the use of all specified references and to avoid hallucinated or circular argumentation, NaturalProver augments generation with hard and soft constraints:

Constraint: All references supplied must appear at least once in the generated proof.
Value function for search at each generation step combines log-probability with a constraint satisfaction signal:

$v_\alpha(y_{\leq t}) = \alpha \cdot v_\text{constraint}(y_{\leq t}) + (1-\alpha) \cdot v_\text{LM}(y_{\leq t})$

Where $v_{\text{constraint}}$ is the number of distinct references used up to step $t$ , and $v_{\text{LM}}$ is the sum of log-likelihoods.

Decoding strategy: At each proof step, the model generates $k$ candidates using random sampling (varying temperature), followed by beam or best-candidate selection according to the value function. This process (“Stepwise++”) continues iteratively until all references are used and a proof is completed.

This machinery both grounds proofs and regularizes their structure to avoid tangential or unsupported steps.

4. Empirical Results: Human and Automatic Evaluation

NaturalProver is evaluated on the NaturalProofs-Gen test set, which consists of several thousand theorems and associated human-written proofs. Evaluations include both:

Human judgments by university mathematics students (100 dev/test problems): These assess step correctness, usefulness as hints, full proof correctness, and error typology (reference use, equations, general logical flaws).
Automatic metrics:
- Ref-F1 and kF1: Coverage of references required by the gold proof.
- Token-level BLEU/GLEU/F1: Lexical overlap with gold proof steps.
- Stepwise correctness: Fraction of steps judged as correct or useful.

Results summary:

Model	Step Useful (%)	Step Correct (%)	Full Proof Useful (%)	Full Proof Correct (%)	Ref Error (%)	Eqn Error (%)	Other Error (%)
GPT-3 (fine-tuned)	25.7	28.2	20	13	30.9	32.5	40.1
NaturalProver Retrieve	41.5	33.6	32	24	23.5	37.6	23.7
NaturalProver (references)	39.6	26.3	35	24	25.8	35.9	25.2
NaturalProver Stepwise++	46.6	35.4	45	32	23.6	28.5	18.4
Next-step suggestion	51.4	42.9	--	--	19.7	26.3	19.1

Key findings:

Next-step generation is rated "correct" or "useful" in over 40% of cases, a first for open-domain natural mathematical discourse.
Full generated proofs are rated correct in 32% of test cases, more than doubling baseline GPT-3.
Both reference and symbolic reasoning errors are reduced; explicit reference grounding virtually eliminates reference hallucinations (<2% incidence).

5. Examples and Qualitative Observations

NaturalProver routinely produces stepwise proofs for undergrad-level theorems in topology, algebra, and real analysis, including proofs requiring induction, layer-wise application of definitions, or nontrivial reference deployment. Illustrative examples include:

Singleton sets in topology: Correctly deploys a theorem about isolated points, then invokes the definition of "dense-in-itself."
Inductive proofs: Successfully identifies and constructs the induction template, then invokes the base and inductive steps with references to previously derived results.

The system can synthesize all steps required for short (2–6 step) proofs and generate plausible next steps or hints useful to mathematics students.

6. Significance, Limitations, and Prospects

NaturalProver experimentally demonstrates that knowledge-grounded neural LMs can operate in mathematical natural language domains—outperforming generic LMs in terms of both correctness and alignment with human mathematical thinking.

Significance:

Educational utility: Next-step suggestions can underpin intelligent proof assistants or tutoring systems for undergraduate mathematics.
Bridging formal and informal reasoning: Grounding in explicit references moves generative modeling beyond mere pattern completion, toward structured, checkable argument construction.
Proof transparency and traceability: The forced inclusion of references, coupled with stepwise decoding, yields proofs readable and auditable by mathematicians.

Limitations:

Long-chain logical coherence degrades over many proof steps; multi-step symbolic manipulation remains challenging due to error accumulation.
The model operates primarily at the level of natural mathematical discourse, not formal code; correctness cannot be mechanically verified as in proof assistants.
Coverage of advanced domains depends on the scope and granularity of the underlying background knowledge (reference library).

7. Context within Broader Research and Future Directions

NaturalProver is situated at the intersection of knowledge-augmented language modeling and automated mathematical reasoning. Its methodology aligns closely with the NaturalProofs benchmark (Welleck et al., 2021), and is complementary to systems focused on formal proof generation (e.g., in Lean, Isabelle) or formal-informal translation.

Future improvements may involve:

Richer retrieval and grounding using fuller statement content and semantic similarity.
Integration with formal proof tools for combined natural + formal proof support.
Enhanced reference selection and reranking to better guide complex proofs.

NaturalProver's approach establishes a baseline and opens research directions for grounding mathematical LLMs in structured mathematical knowledge for improved transparency and reliability in mathematical reasoning tasks.

PDF Markdown Chat (Pro)

References (2)

NaturalProver: Grounded Mathematical Proof Generation with Language Models (2022)

NaturalProofs: Mathematical Theorem Proving in Natural Language (2021)

Follow Topic

Get notified by email when new papers are published related to NaturalProver.