Papers

Topics

Authors

Recent

View all

Gemini 2.5 Flash

134 tokens/sec

GPT-4o

10 tokens/sec

Gemini 2.5 Pro Pro

47 tokens/sec

o3 Pro

4 tokens/sec

GPT-4.1 Pro

38 tokens/sec

DeepSeek R1 via Azure Pro

28 tokens/sec

2000 character limit reached

Kimina-Prover Preview 72B

Updated 30 June 2025

Kimina-Prover Preview 72B is a large language model-based formal theorem prover for Lean 4 that unifies informal intuition with formal proof code.
It employs a multi-stage reinforcement learning pipeline on a 72-billion parameter Qwen2.5 backbone, integrating high-level reasoning with verified Lean outputs.
The model achieves state-of-the-art performance on benchmarks like miniF2F-test, demonstrating scalable and sample-efficient automated formal reasoning.

Kimina-Prover-Preview-72B is a LLM-based formal theorem prover designed for Lean 4, introduced as a preview release accompanying the paper "Kimina-Prover Preview: Towards Large Formal Reasoning Models with Reinforcement Learning" (2504.11354). Employing a 72-billion parameter Qwen2.5 backbone and a distinctive formal reasoning pattern, Kimina-Prover-Preview-72B leverages a multi-stage reinforcement learning pipeline to achieve state-of-the-art performance on rigorous formal mathematics benchmarks. The system’s architecture, training paradigm, and emergent reasoning style reflect a significant shift from traditional theorem-proving approaches, emphasizing the integration of informal intuition, formal proof code, and scalable, sample-efficient automated reasoning.

1. Model Architecture and Training Pipeline

Kimina-Prover-Preview-72B is built upon the Qwen2.5-72B autoregressive transformer architecture and departs from conventional tree-search or stepwise proof state expansion. Instead, it embraces an internal, open-ended reasoning-driven exploration paradigm. The model is initially fine-tuned in a supervised fashion using a large, curated formal mathematics dataset, followed by reinforcement learning (RL) with a reward provided by the Lean 4 proof assistant.

The RL phase operates as follows:

For each problem, the model samples multiple long-form outputs (k = 8 rollouts per batch).
Each candidate solution's final Lean 4 proof script is checked for correctness using a high-throughput Lean verification backend.
The reward is binary (1 if the proof is accepted by Lean; otherwise 0).
Training leverages a KL-regularized objective:

$L(\theta) = \mathbb{E}_{(x,y^*) \sim D} \left[ \mathbb{E}_{(y,z) \sim \pi_{\text{old}}} \left[ r(x, y, y^*) - \tau \log Z - \tau \log\frac{\pi_\theta(y, z|x)}{\pi_{\text{old}}(y, z|x)} \right] \right]$

with $\tau=0.4$ and formatted output constraints ensuring consistent code-reasoning pattern alignment.

Stabilization methods include enforcing that at least one tactic block appears per output, requiring ≥60% of generated code snippets (from the thinking block) be reused in the final proof, and filtering negative gradient samples with probability $\omega=0.5$ . Early RL phases mix in informal mathematics data to better guide reasoning. The RL infrastructure is supported by the Numina Lean Server, which achieves verification throughput up to 100 proofs/sec on large CPU clusters.

2. Formal Reasoning Pattern and Output Structure

Central to Kimina-Prover-Preview-72B is the formal reasoning pattern, a structured output template interleaving informal mathematical thinking and formal Lean 4 code. Each response generally follows:

Thinking Block: A section enclosed in > ... in which the model articulates its high-level strategy, decomposes the problem, and sketches intermediate reasoning steps. These may include partially formalized Lean code fragments.
Lean Code Snippets: Intermediate steps within the thinking block use Lean syntax. Most snippets are reused in the definitive proof.
Final Proof Assembly: After the thinking block, the complete Lean 4 proof script is assembled, typically integrating fragments and tactics developed during the reasoning phase.

This format contrasts sharply with stepwise search-based provers, which generate sequences of atomic tactic applications or tree-exploring search histories. Instead, Kimina-Prover's long-form outputs unify problem analysis, reflection, decomposition, and formal proof planning in a single human-comprehensible response. The format aids both explainability and educational value, as users can trace informal ideas to verified formal arguments.

Example (from (2504.11354)):

<think>
First, let's think about the structure...
Let's formalize this in Lean 4:
have hd : d = 15 / 2 := by linarith
have ha : a = -15 := by linarith [h₀, hd]
linarith [ha, hd]
</think>
theorem mathd_algebra_354 (a d : ℝ) (h₀ : a + 6 * d = 30) (h₁ : a + 10 * d = 60) : a + 20 * d = 135 := by
  have hd : d = 15 / 2 := by linarith
  have ha : a = -15 := by linarith [h₀, hd]
  linarith [ha, hd]

3. Performance on Formal Benchmarks

Kimina-Prover-Preview-72B achieves new state-of-the-art results on formal mathematics benchmarks, most notably the miniF2F-test, a widely used Lean 4 benchmark suite:

Model	Pass@1	Pass@32	Pass@1024	Pass@8192
Kimina-Prover-Preview-72B	52.9%	68.9%	77.9%	80.7%
Kimina-Prover-7B	52.5%	63.1%	70.8%	—
Kimina-Prover-1.5B	42.6%	56.2%	61.9%	—
BFS-Prover (7B)	—	—	—	70.8%
DeepSeek Prover V1.5 RL	—	—	—	60.2%

Performance scales systematically with model size, a property not previously observed in neural theorem provers for formal mathematics. Notably, the model demonstrates strong sample efficiency, solving 52.94% of miniF2F-test at pass@1 (one sample per problem), outperforming previous stepwise or search-based approaches, which require extensive search or sampling for similar accuracy.

4. Scalability and Model Distillation

The scaling trend is evident in the pass@k metrics: as model size increases from 1.5B to 7B to 72B parameters, accuracy rises substantially at every sample count. This establishes that larger LLMs, when equipped with appropriate RL and data pipelines, yield stronger formal reasoning capabilities—contradicting previous findings that scaling did not improve state-of-the-art neural theorem provers.

To address resource constraints, Kimina-Prover-Preview-72B is distilled into compact 1.5B and 7B models using supervision on rollouts generated by the main model. These distilled versions retain much of the performance gain, offering scalable solutions for diverse research and deployment scenarios.

5. Comparison with Other Formal Reasoning Models and Approaches

Kimina-Prover-Preview-72B represents a conceptual shift in formal reasoning model design:

Versus Stepwise Search Models: Prior systems such as BFS-Prover or stepwise tree searchers rely on the explicit expansion of proof search trees using learned critics and tactic predictors. These approaches can be sample-inefficient, prone to tactical bias, and generate longer, more redundant proofs.
Versus Whole-Proof and Chain-of-Thought Systems: Unlike generic chain-of-thought (CoT) methods or purely whole-proof synthesis, Kimina-Prover introduces structured internal exploration, continuous refinement, and human-like decomposition, reflected in the format of outputs and verification success rates.

Emergent behaviors—such as long-range planning, reflective reasoning, and effective tactic reuse—are attributed to the fusion of RL training, formal reasoning pattern enforcement, and use of Lean 4’s verification feedback.

6. Implications for Formal Verification Practice and Mathematical AI

The Kimina-Prover-Preview-72B paradigm offers several implications and opportunities:

Bridging Informal and Formal Mathematics: The reasoning style supports transparency and provides a foundation for aligning informal mathematical intuition with formal proof scripts, addressing a key barrier in mathematical AI.
Automation of Formalization and Verification: Consistent production of valid Lean 4 code creates a path toward scalable, automated mathematics formalization and machine-checked mathematical research.
Human-AI Collaboration: The combination of readable informal reasoning and formally verified output in a single artifact enhances educational value, facilitates interactive proof development, and opens AI-in-the-loop research workflows.
Research Integration and Tooling: The outputs are designed for compatibility with future enhancements such as library search, computational plugins, or iterative proof improvement driven by Lean’s feedback mechanisms.

7. Technical Resources and Open Source Availability

Kimina-Prover-Preview-72B and its distilled variants are provided publicly to facilitate research and application development:

Distilled models: https://github.com/MoonshotAI/Kimina-Prover-Preview
Autoformalizer companion model: https://huggingface.co/AI-MO/Kimina-Autoformalizer-7B

These releases include prompt templates, architecture specifications, and exemplar output, providing researchers with tools to replicate, adapt, or extend the system in new domains.

Kimina-Prover-Preview-72B advances the field of automated formal reasoning by unifying large-scale autoregressive modeling, reinforcement learning, and human-inspired structured reasoning. Its benchmark results, sample efficiency, and alignment of informal and formal mathematical skills define a new standard for deployable and research-grade formal mathematics assistants.

PDF Markdown Chat (Upgrade)

References (1)

Kimina-Prover Preview: Towards Large Formal Reasoning Models with Reinforcement Learning (2025)