DeepSeekMath-7B: Math Reasoning LLM

Updated 29 December 2025

DeepSeekMath-7B is an open-source large language model that excels in mathematical reasoning and theorem proving through innovative reinforcement learning and self-verification techniques.
It builds on a 7B parameter decoder-only transformer architecture, leveraging extensive domain-specific pretraining on 500 billion tokens and rigorous data curation.
Empirical benchmarks reveal near state-of-the-art performance in math competitions and formal proof tasks, setting a new open-source standard for mathematical LLMs.

DeepSeekMath-7B is an open-source LLM specifically optimized for mathematical reasoning, achieving near state-of-the-art results among 7-billion-parameter (7B) models through rigorous data curation, transformer architecture specialization, and reinforcement learning. Developed on the DeepSeek-Coder-Base-v1.5 7B foundation, it introduces techniques such as Group Relative Policy Optimization (GRPO) and self-verification modules, distinguishing itself as a leading math-focused LLM for both competition-level mathematics and formal theorem proving tasks (Shao et al., 5 Feb 2024, Shao et al., 27 Nov 2025).

1. Architectural Foundations and Model Specification

DeepSeekMath-7B adopts a decoder-only Transformer architecture, directly inheriting its backbone from DeepSeek-Coder-Base-v1.5. The baseline configuration encompasses approximately 7 billion parameters distributed across 32 layers, each with 32 attention heads, a hidden size of 4096, and a feed-forward projection dimension of 16,384. Both token and position embeddings inhabit a 4096-dimensional space. The vocabulary size is 100,000 tokens to accommodate a broad mathematical lexicon and symbolic expressions (Shao et al., 5 Feb 2024).

Inference and training leverage FlashAttention2 for efficient memory management and throughput. Mixed-precision (fp16) training is orchestrated by the HAI-LLM framework under the AdamW optimizer (β₁=0.9, β₂=0.95, weight_decay=0.1, multi-step learning rate schedule). Extended context windows (up to 128,000 tokens in DeepSeekMath-V2) are realized via DeepSeek sparse attention, enabling long-form mathematical reasoning (Shao et al., 27 Nov 2025). The V2 framework introduces no changes to the core transformer blocks but attaches lightweight proof-analysis modules—extra position-wise MLPs tied to output steps that generate “issue” summaries and rubric scores for self-verification.

2. Data Curation, Pretraining, and RL Fine-Tuning

The DeepSeekMath-7B pretraining regimen emphasizes domain-targeted data collection and large-scale corpus construction:

Mathematical Corpus Extraction: The DeepSeekMath Corpus comprises 120 billion math-focused tokens mined from Common Crawl, identified through a fastText classifier trained to distinguish mathematics-rich text. This process included deduplication, domain curation, reference-based seeding (OpenWebMath), and iterative recall containing 35.5 million pages, followed by strict decontamination (removing any overlap with MATH, GSM8K, CMATH, AGIEval benchmarks) (Shao et al., 5 Feb 2024).
Token Distribution: Extended pretraining of 500 billion tokens combines 56% DeepSeekMath Corpus, 4% AlgebraicStack code, 10% arXiv math papers, 20% GitHub, and 10% general web text.
Supervised Fine-Tuning (SFT): 776,000 chain-of-thought, program-of-thought, and tool-integrated examples serve as intermediate optimization before RL.
Reinforcement Learning via GRPO: Fine-tuning employs Group Relative Policy Optimization (GRPO), a PPO variant that omits a value critic in favor of group relative advantage normalization and reduces memory consumption by ~50% compared to PPO. This approach uses process supervision (intermediate step rewards) for expedited convergence and better process-level reasoning (Shao et al., 5 Feb 2024).

The iterative data selection pipeline, alongside rigorous decontamination and arXiv-focused supplementation, allows DeepSeekMath-7B to surpass models trained purely on MathPile or Proof-Pile-2 corpora.

3. Self-Verification and Theorem-Proving Mechanisms (DeepSeekMath-V2)

DeepSeekMath-V2 extends the core model with self-verifiable mathematical reasoning capabilities (Shao et al., 27 Nov 2025):

Verifier Model: Trained with policy gradients on annotated datasets $\mathcal{D}_v = \{(X_i, Y_i, s_i)\}$ , the verifier $\pi_\phi$ outputs both free-text issue summaries and rubric scores $s \in \{0, 0.5, 1\}$ . The loss objective combines reward for accurate scoring and correct formatting, augmented by meta-verification from a second verifier ( $\pi_\eta$ ) on the verifier's output.
Proof Generator: The generator $\pi_\theta$ is trained to simultaneously produce proofs and self-analyses. Rewards combine the verifier’s scores ( $R_Y=s$ ) and a meta-scored self-analysis ( $R_Z$ ), aggregated as $R = R_{\mathrm{format}(Y,Z)}(\alpha R_Y + \beta R_Z)$ with $\alpha=0.76, \beta=0.24$ .
Refinement Loop: At inference, the model refines its output by recursively re-prompting itself based on detected errors, repeating until proofs are self-corroborated (“\boxed{1}”) or context limits are reached.
High-Compute Search: At test-time, a pool of 64 proof candidates per problem is maintained; each undergoes 64 independent verification passes, and refinement continues across up to 16 iterations, exploring the generator–verifier gap through adversarial evaluation.

This framework enables iterative probing for logical rigor and stepwise revisiting of incomplete/incorrect arguments, directly addressing pitfalls of answer-only reward RL paradigms.

4. Empirical Results and Benchmark Standing

Empirical evaluation demonstrates DeepSeekMath-7B’s capabilities across standard and competition-level datasets (Shao et al., 5 Feb 2024, Shao et al., 27 Nov 2025):

Metric / Dataset	DeepSeekMath-7B	DeepSeekMath-V2 (V2: SOTA)
MATH (competition)	51.7% (top-1 CoT) / 60.9% (self-consistency 64x)	V2: Gold-level on IMO 2025 (5/6), CMO 2024 (4/6+1 partial), Putnam 2024 (118/120, top human: 90)
GSM8K	88.2% (CoT RL)	—
IMO-ProofBench	85% (basic, V2)	65% (advanced subset, V2)
CNML in-house	—	Outperforms GPT-5-Thinking-High, Gemini 2.5-Pro (+10–15 verifier points)
miniF2F proof reward	31.4%	—
Pass@1 (IMO Shortlist)	—	V2 climbs from ~0.3 to ~0.7 (with self-verification loop)
Best@32	—	V2: ~0.85 (IMO)

Ablation studies reveal a ~12-point drop on IMO when meta-verification is removed, and a Putnam score decrease from 118 to 104 if generator self-analysis is disabled.

5. Comparative Benchmarks and Extensions

DeepSeekMath-7B achieves top-1 accuracy that rivals closed models: 51.7% (no external tools or voting) on MATH, compared to GPT-4 at 52.9% and Gemini-Ultra at 53.2%. Under self-consistency sampling (64 votes), it closely matches GPT-4 and Gemini-Ultra.

Related methods such as SuperCorrect apply hierarchical distillation and cross-model Direct Preference Optimization (DPO) to DeepSeekMath-7B, further raising MATH accuracy from 46.8% (CoT) to 54.6%, a 7.8% absolute gain, and GSM8K from 82.9% to 88.2%. SuperCorrect’s two-stage process infuses micro- and macro-level reasoning templates and exposes the model to error-driven self-correction, yielding more stable and nuanced mathematical reasoning (Yang et al., 11 Oct 2024).

6. Strengths, Limitations, and Future Directions

DeepSeekMath-7B’s design highlights:

The significance of scale and quality in mathematical corpora for emergent mathematical capabilities—carefully filtered data outperforms sheer token quantity.
The efficacy of code-first pretraining, particularly for symbolic and program-augmented reasoning.
The practical memory advantages of GRPO, enabling higher RL batch sizes without a value critic.
State-of-the-art empirical performance in gold-medal math competitions and formal proof tasks, especially when augmented with V2’s self-verification framework.

Identified weaknesses and open challenges include limited generalizability in advanced geometry and formal proof skills relative to large closed-source models, modest improvements from few-shot scaling, and the need for reward models with greater uncertainty estimation and out-of-distribution generalization. Incorporating richer geometric data and domain-adaptive templates represent plausible avenues for improvement (Shao et al., 5 Feb 2024, Shao et al., 27 Nov 2025, Yang et al., 11 Oct 2024).

7. Impact and Significance in the Mathematical LLM Landscape

DeepSeekMath-7B stands as a reference system for open-source mathematical LLM research, demonstrating that public data, transparent optimization, and reinforcement learning innovations can close the performance gap with proprietary models. The model provides a benchmark for subsequent research on mathematical data filtering, RL algorithms like GRPO, and formal theorem-proving integration (Shao et al., 5 Feb 2024, Shao et al., 27 Nov 2025). Its V2 iteration advances the paradigm of self-verifiable reasoning, facilitating more rigorous, stepwise proofs and automated verification—contributing foundational methodologies for both LLM-driven mathematical research and broader scientific AI applications.