Spark-Prover-X1-7B: 7B Formal Theorem Prover

Updated 24 November 2025

Spark-Prover-X1-7B is a 7-billion-parameter decoder-only transformer designed to automate formal mathematical theorem proving.
It uses a three-stage training pipeline, including continuous pre-training, expert iteration, and GRPO-based reinforcement learning to enhance reasoning capabilities.
The model demonstrates competitive performance on benchmarks like CombiBench and PutnamBench and integrates seamlessly with proof assistants such as Lean and Coq.

Spark-Prover-X1-7B is a 7-billion-parameter, decoder-only transformer model, architected for formal automated theorem proving in mathematics. Developed as part of the Spark-Prover-X1 initiative, it leverages a rigorously structured three-stage training pipeline that integrates diverse mathematical data, expert-driven supervised fine-tuning, and reinforcement learning, aiming to enhance lightweight LLM capability on formal mathematical reasoning tasks. The model is evaluated using a suite of real-world and competition-grade formal proof benchmarks and is distributed alongside its formalizer counterpart, Spark-Formalizer-X1-7B, and the comprehensive ExamFormal-Bench dataset (Zhou et al., 17 Nov 2025).

1. Model Architecture

Spark-Prover-X1-7B utilizes a decoder-only transformer configuration consistent with LLaMA-style foundational models, employing standard multi-head self-attention mechanisms (Zhou et al., 17 Nov 2025). With 7 billion parameters, the model's architectural details beyond parameter count—such as specific layer counts or hidden dimensions—are not reported in the source. The transformer is designed to support step-wise state and tactic encoding required by interactive theorem proving environments.

2. Progressive Training Methodology

A three-stage training methodology underpins Spark-Prover-X1-7B, aiming for both breadth and depth in formal mathematical competence.

2.1 Continuous Pre-training

Mathematical Corpus: The pre-training draws from open-source formal libraries (e.g., Lean, Mathlib4, tactic tutorials, and curated GitHub repositories), as well as natural-language mathematical problems spanning middle-school to undergraduate syllabi and synthetic datasets that pair formal and informal proofs or decompose proofs into subgoals.
Data Tasks: The pre-training addresses step-wise tactic/state extraction and introduces CoT-augmented state prediction. In the latter, given a "state before" $s_{\mathrm{bf}}$ and a tactic $t$ , the model predicts the "state after" $s_{\mathrm{af}}$ and an accompanying natural-language chain-of-thought ( $\widehat{\mathrm{CoT}}$ ), optimizing a state prediction loss:

$\mathcal{L}_{\mathrm{state}} = -\,\mathbb{E}_{(s_{\mathrm{bf}},t,s_{\mathrm{af}})} \bigl[\log P_{\theta}(s_{\mathrm{af}}\mid s_{\mathrm{bf}},t)\bigr]$

Filtering enforces exact match to ground-truth next states.

2.2 Supervised Fine-Tuning and Expert Iteration

Expert Iteration Loop: Following pre-training, the model generates new auto-formalization and proof data, which is verified and appended to the SFT corpus. Fine-tuned models then participate in subsequent data generation. This loop is designed to expand the distribution and coverage of formal states and proofs.
Objective: Cross-entropy over tokenized formal or proof data:

$\mathcal{L}_{\mathrm{SFT}} = -\,\mathbb{E}_{(x,y)}\Bigl[\sum_{t=1}^T \log P_{\theta}(y_t\mid y_{<t},\,x)\Bigr]$

2.3 Group Relative Policy Optimization (GRPO)

Difficulty Filtering: The hardest subset of formal theorems is selected after SFT.
Objective and Update: A single RL round (clip-higher variant of GRPO) proceeds, normalizing per-token advantages over maximum trajectory length and incorporating group-based trajectory rewards. The objective per query $q\sim\mathcal{D}$ is:

$\begin{aligned} \mathcal{J}(\theta) &= \mathbb{E}_{q,\{o_i\}\sim\pi_{\theta_{\mathrm{old}}} \Biggl[\frac{1}{G}\sum_{i=1}^G \frac{1}{|o|_{\max}} \sum_{t=1}^{|o_i|} \Bigl\{ \min\bigl(r_{i,t}(\theta)\hat A_i, \mathrm{clip}(r_{i,t}(\theta),1-\varepsilon_{\mathrm{low}},1+\varepsilon_{\mathrm{high}})\,\hat A_i \bigr) - \beta\,D_{\mathrm{KL}}\bigl(\pi_{\mathrm{ref}}(\cdot\mid q,o_{i,<t})\;\|\;\pi_\theta(\cdot\mid q,o_{i,<t})\bigr) \Bigr\} \Biggr] \end{aligned}$

where $\hat{A}_i$ denotes standardized group reward advantages, $r_{i,t}(\theta)$ the likelihood ratio, and $\pi_{\mathrm{ref}}$ is the SFT reference policy.

3. Datasets and Benchmarks

Evaluation emphasizes generalization and real-world difficulty via ExamFormal-Bench (402 problems sampled across analysis, geometry, algebra, probability & statistics, computational mathematics, discrete mathematics), which is manually OCR’d and verified (Zhou et al., 17 Nov 2025). Additional benchmarks include miniF2F-test, ProofNet-test, PutnamBench, CombiBench, FormalMATH-Lite, ProverBench, and MathOlympiadBench.

Quantitative Results

Benchmark	DeepSeek-7B	Kimina-8B	Goedel-8B	Spark-Prover-X1-7B
miniF2F-test	75.6%	78.3%	84.6%	75.0%
ProofNet-test	23.0%	11.0%	19.4%	23.1%
PutnamBench	1.4%	2.9%	3.8%	4.7%
CombiBench	16.0%	6.0%	12.0%	24.0%
FormalMATH-Lite	51.8%	55.5%	55.3%	59.8%
ProverBench	49.0%	38.8%	52.0%	47.4%
MathOlympiadBench	8.9%	8.1%	10.8%	11.1%
ExamFormal-Bench	49.0%	45.3%	48.8%	51.2%
Average	34.3%	30.7%	35.8%	37.0%

A 37.0% average pass@32 places Spark-Prover-X1-7B above other open-source models of similar size. On CombiBench, it achieves a 50% relative increase (24.0% vs. DeepSeek’s 16.0%), and it solves 27 PutnamBench problems, exceeding Goedel and DeepSeek at the same sampling budget. No formal significance testing is reported.

4. Integration with Formal Verification and Proof Assistants

Spark-Prover-X1-7B is designed to be embedded into proof environments, enabling it to represent proof states and tactics as ASTs compatible with theorem provers such as Lean or Coq. Its data tasks explicitly encode stepwise state transitions and tactic predictions, allowing interactive or autoformalization workflows. In conjunction with the Haskell-based PureSpark formal specification for parallel aggregation (Chen et al., 2017), it supports formal reasoning about determinism and correctness in distributed dataflow computations—a plausible implication is that this alignment enhances its utility for verifying Spark-style aggregation pipelines and similar distributed systems.

5. Release, Licensing, and Research Resources

All model checkpoints (Spark-Prover-X1-7B and Spark-Formalizer-X1-7B), along with ExamFormal-Bench, are publicly available:

The code repositories contain standard open-source licenses permitting research usage. Licensing nuances beyond this are not specified (Zhou et al., 17 Nov 2025). The release contextualizes Spark-Prover-X1-7B as a resource for the formal mathematics, logic, and theorem proving communities, supporting further research into lightweight transformer-based formal reasoning at scale.

6. Context and Prospects

Spark-Prover-X1-7B demonstrates that modern data curation strategies, integrating diverse formal and informal content with CoT-augmented state prediction, enable significant advances in the formal theorem proving abilities of 7B-parameter LLMs. The approach leverages both synthesis (auto-formalization) and reinforcement-based sharpening (GRPO), resulting in models with improved robustness on real-world and competition-grade benchmarks.

Future prospects, as suggested by ongoing work, involve further automation of data discovery and extension to more sophisticated proof environments and distributed reasoning tasks. The release of both models and datasets encourages replication, extension, and critical evaluation within the theorem proving and machine reasoning research communities (Zhou et al., 17 Nov 2025).

PDF Markdown Chat (Pro)

References (2)

Spark-Prover-X1: Formal Theorem Proving Through Diverse Data Training (2025)

An Executable Sequential Specification for Spark Aggregation (2017)

Whiteboard

Generate a whiteboard explanation of this topic.

Topic to Video (Beta)

Generate a video overview of this topic.

Follow Topic

Get notified by email when new papers are published related to Spark-Prover-X1-7B.