DeepSeekMath 7B: Open Math Transformer

Updated 5 November 2025

DeepSeekMath 7B is a 7-billion parameter transformer model that advances mathematical reasoning through domain-specific pre-training, novel GRPO reinforcement, and refined instruction fine-tuning.
It leverages a highly curated multilingual math corpus of 500B tokens sourced via iterative filtering to ensure high-quality, domain-specific data.
Comparative benchmarks show that DeepSeekMath 7B nears proprietary models like GPT-4, demonstrating robust performance and efficient reinforcement learning with GRPO.

DeepSeekMath 7B is a 7-billion parameter, open-source transformer LLM designed to advance the mathematical reasoning abilities of LLMs at moderate parameter scale. Extending the DeepSeek codified architecture, DeepSeekMath 7B integrates large-scale, domain-specific pre-training, novel reinforcement learning via Group Relative Policy Optimization (GRPO), and refined instruction fine-tuning, enabling performance that approaches proprietary models such as GPT-4 and Gemini-Ultra on the most rigorous math benchmarks. The model sets a new standard for open-access mathematical reasoning, both in English and Chinese, and serves as a primary reference point for subsequent research into math-centric LLMs.

1. Architecture and Data Curation

DeepSeekMath 7B is based on a transformer backbone—inheriting from DeepSeek-Coder-Base-v1.5 7B—with 7B non-embedding parameters, a 100K-token vocabulary, and a 4K-token context window. The development pipeline centers on a highly curated, multilingual math corpus:

Core Dataset: 35.5M mathematical web pages amounting to 120B tokens sourced from Common Crawl, extracted via iterative filtering using a fastText classifier trained on OpenWebMath positives and diverse negatives. This approach ensures high coverage of mathematical topics while minimizing off-domain contamination.
Additional Sources: 4% AlgebraicStack, 10% arXiv texts (included despite limited impact), 20% GitHub code, 10% natural language (EN/ZH).
Benchmark Decontamination: Aggressive pattern-matching ensures exclusion of problems overlapping GSM8K, MATH, CMATH, and AGIEval, reducing information leakage and evaluation inflation.
Total Pre-training Tokens: 500B, with an empirical advantage for code-first initialization versus purely natural language initialization.

Pre-training is followed by supervised fine-tuning (SFT) on 776K high-quality math instruction samples, employing both chain-of-thought (CoT) and program-of-thought paradigms, with explicit coverage of tool use.

2. Reinforcement Learning via Group Relative Policy Optimization (GRPO)

The model employs Group Relative Policy Optimization, a memory- and compute-efficient RL algorithm tailored for sparse-reward, long-form generation tasks such as mathematical reasoning.

Motivation: Classic PPO suffers from large memory demands and unstable value function learning, especially in the context of lengthy solutions and non-dense reward signals.
GRPO Mechanism:

For question $q$ , $G$ outputs are sampled.
Each output is rewarded via a learned reward model, providing a reward vector $\mathbf{r}$ .
Rewards are standardized within the group: $\tilde{r}_i = \frac{r_i - \text{mean}(\mathbf{r})}{\text{std}(\mathbf{r})}$ .
Policy is updated using a surrogate loss with KL regularization to the SFT reference, avoiding an explicit value network:

$\mathcal{J}_{\text{GRPO}}(\theta) = \mathbb{E}_{q, \{o_i\}} \left[ \frac{1}{G} \sum_{i=1}^G \frac{1}{|o_i|} \sum_{t=1}^{|o_i|} \left( \min\left[ \frac{ \pi_\theta(o_{i,t} | q, o_{i,<t}) }{\pi_{\theta_{\text{old}}}(o_{i,t} | q, o_{i,<t}) } \tilde{A}_{i,t}, \text{clip}(\ldots) \right] - \beta \mathbb{D}_{KL}[\pi_\theta || \pi_{\text{ref}}] \right) \right]$
Process supervision: Stepwise per-token rewards can be incorporated for detailed guidance, rather than solely at the completion of the answer.

Advantages of this approach are reduced memory requirements (no separate value model), tighter reward alignment, and increased stability due to within-batch normalization. Empirically, ablations demonstrate that GRPO and iterative online RL outperform DPO/RFT baselines and enable robust learning from preference-based reward signals.

3. Mathematical Reasoning Benchmark Performance

Standard Benchmarks

DeepSeekMath 7B achieves high performance on the most challenging public math benchmarks:

Model	Parameters	MATH (Top-1)	GSM8K	Notes
GPT-4	Proprietary	52.9%	92.0%	Reference
Gemini-Ultra	Proprietary	53.2%	94.4%	Reference
DeepSeekMath-RL 7B	7B	51.7%	88.2%	Open, 7B, RL-tuned
InternLM2-Math 20B	20B	37.7%	N/A	Open
WizardMath-v1.1 7B	7B	33.0%	N/A	Open

Self-Consistency (Maj@64) on MATH: 60.9% for DeepSeekMath-RL 7B.
Multilingual Superiority: Consistent leadership on Chinese math benchmarks due to the multilingual training corpus.
Architecture-Driven Gains: Initializing from a code LLM empirically yields stronger results than from a vanilla LLM, both with and without recourse to tool use.

Comparative Analysis

DeepSeekMath 7B constitutes a substantial advance over prior open models (e.g., Llemma-34B, Minerva-540B), narrowing the gap with closed-source models even at 7B parameters. It demonstrates that mathematical pre-training inflates both general reasoning and cross-domain code capability—contradicting previous expectations regarding the utility of arXiv text for mathematical LLMs, which was found to provide negligible gains.

4. Failed Modes and Known Limitations

Despite the progress, DeepSeekMath 7B exhibits certain bottlenecks:

Complexity Limitation: The model tends to plateau on problems requiring multi-step, compositional, or error-detection-and-correction reasoning, with performance constrained by its inability to monitor and revise intermediate logic.
RL Limitation: On highly saturated benchmarks (e.g., MATH-500), RL-based approaches predominantly "sharpen" the existing solution set, failing to discover genuinely novel solution modes, a limitation prominently exposed by the MATH-Beyond (MATH-B) benchmark (Mayilvahanan et al., 13 Oct 2025).

On MATH-B splits constructed to defeat base models even with pass@1024 sampling, DeepSeekMath 7B (as DeepSeek-R1-Qwen2.5-7B) achieves zero solutions, delineating the true reasoning boundary. Even with advanced RL-fine-tuned models (e.g., Skywork-OR1-7B), expansion into unsolved MATH-B is limited to 21.21% at best, and SFT or CoT-distilled student models from larger teachers achieve higher expansion (up to 66.38% for Qwen3-8B), indicating RL's limited capacity for expansion without stronger supervision.

5. Methodological Contributions

Beyond performance, DeepSeekMath 7B advances several methodological directions:

Data Curation at Scale: Demonstrates the viability and impact of classifier-driven, web-scale math data extraction, surpassing the scale and diversity of precedent LLM math corpora.
Unified Training Loss Formulation: Presents a generalized loss gradient framework subsuming SFT, DPO, PPO, and GRPO, with the critical insight that group-based within-batch baseline estimation suits the comparative feedback structures of math reward models:

$\nabla_{\theta}\mathcal{J}_\mathcal{A}(\theta) = \mathbb{E}_{(q,o)\sim \mathcal{D}} \left[ \frac{1}{|o|} \sum_{t=1}^{|o|} GC_\mathcal{A}(q,o,t) \nabla_\theta \log \pi_\theta(o_t \mid q, o_{<t}) \right]$

where $GC_\mathcal{A}$ corresponds to the method-specific gradient coefficient.

Algorithmic Efficiency: GRPO offers orders-of-magnitude improvements in memory and compute efficiency over PPO, facilitating long-horizon RL in LLMs.

6. Impact, Availability, and Future Directions

DeepSeekMath 7B’s release (code and data: https://github.com/deepseek-ai/DeepSeek-Math) establishes an open baseline for math reasoning at moderate scale, spurring further innovation in math-focused LLM research. The architectural and algorithmic advances (e.g., code-first pretraining, GRPO, classifier-mined datasets) have influenced subsequent efforts, such as the SuperCorrect-7B system (Yang et al., 11 Oct 2024), which adapts DeepSeekMath 7B via hierarchical thought template distillation and cross-model collaborative DPO to further enhance error correction and generalization.

Current research is prioritizing:

Improved error localization and correction in reasoning chains
Data curation strategies for broader and more balanced problem coverage
Advanced reward modeling for RL (with greater exploration encouragement)
Scaling model and context size to approach few-shot and tool-use parity with leading closed models

Summary Table: Design and Innovations

Component	Key Feature	Impact
Architecture	7B param transformer, code-initialized	Efficient math reasoning
Data	120B+ tokens, classifier-mined, decontaminated	High-quality, multilingual coverage
RL Method	GRPO (group-normalized, reference-free)	Memory/computation gains, robust RL
Benchmarks	MATH: 51.7%, GSM8K: 88.2%	SOTA among open 7B models
Algorithmic Flow	SFT → RL (GRPO), process supervision	Versatile, modular training

DeepSeekMath 7B represents a foundational point in open-access mathematical reasoning research, delivering both methodological and empirical advances and highlighting the critical barriers that remain in the automatic solution of complex, competition-level mathematics.