Papers
Topics
Authors
Recent
Assistant
AI Research Assistant
Well-researched responses based on relevant abstracts and paper content.
Custom Instructions Pro
Preferences or requirements that you'd like Emergent Mind to consider when generating responses.
Gemini 2.5 Flash
Gemini 2.5 Flash 177 tok/s
Gemini 2.5 Pro 50 tok/s Pro
GPT-5 Medium 31 tok/s Pro
GPT-5 High 32 tok/s Pro
GPT-4o 93 tok/s Pro
Kimi K2 183 tok/s Pro
GPT OSS 120B 447 tok/s Pro
Claude Sonnet 4.5 36 tok/s Pro
2000 character limit reached

DeepSeekMath 7B: Open Math Transformer

Updated 5 November 2025
  • DeepSeekMath 7B is a 7-billion parameter transformer model that advances mathematical reasoning through domain-specific pre-training, novel GRPO reinforcement, and refined instruction fine-tuning.
  • It leverages a highly curated multilingual math corpus of 500B tokens sourced via iterative filtering to ensure high-quality, domain-specific data.
  • Comparative benchmarks show that DeepSeekMath 7B nears proprietary models like GPT-4, demonstrating robust performance and efficient reinforcement learning with GRPO.

DeepSeekMath 7B is a 7-billion parameter, open-source transformer LLM designed to advance the mathematical reasoning abilities of LLMs at moderate parameter scale. Extending the DeepSeek codified architecture, DeepSeekMath 7B integrates large-scale, domain-specific pre-training, novel reinforcement learning via Group Relative Policy Optimization (GRPO), and refined instruction fine-tuning, enabling performance that approaches proprietary models such as GPT-4 and Gemini-Ultra on the most rigorous math benchmarks. The model sets a new standard for open-access mathematical reasoning, both in English and Chinese, and serves as a primary reference point for subsequent research into math-centric LLMs.

1. Architecture and Data Curation

DeepSeekMath 7B is based on a transformer backbone—inheriting from DeepSeek-Coder-Base-v1.5 7B—with 7B non-embedding parameters, a 100K-token vocabulary, and a 4K-token context window. The development pipeline centers on a highly curated, multilingual math corpus:

  • Core Dataset: 35.5M mathematical web pages amounting to 120B tokens sourced from Common Crawl, extracted via iterative filtering using a fastText classifier trained on OpenWebMath positives and diverse negatives. This approach ensures high coverage of mathematical topics while minimizing off-domain contamination.
  • Additional Sources: 4% AlgebraicStack, 10% arXiv texts (included despite limited impact), 20% GitHub code, 10% natural language (EN/ZH).
  • Benchmark Decontamination: Aggressive pattern-matching ensures exclusion of problems overlapping GSM8K, MATH, CMATH, and AGIEval, reducing information leakage and evaluation inflation.
  • Total Pre-training Tokens: 500B, with an empirical advantage for code-first initialization versus purely natural language initialization.

Pre-training is followed by supervised fine-tuning (SFT) on 776K high-quality math instruction samples, employing both chain-of-thought (CoT) and program-of-thought paradigms, with explicit coverage of tool use.

2. Reinforcement Learning via Group Relative Policy Optimization (GRPO)

The model employs Group Relative Policy Optimization, a memory- and compute-efficient RL algorithm tailored for sparse-reward, long-form generation tasks such as mathematical reasoning.

  • Motivation: Classic PPO suffers from large memory demands and unstable value function learning, especially in the context of lengthy solutions and non-dense reward signals.
  • GRPO Mechanism:
  1. For question qq, GG outputs are sampled.
  2. Each output is rewarded via a learned reward model, providing a reward vector r\mathbf{r}.
  3. Rewards are standardized within the group: r~i=rimean(r)std(r)\tilde{r}_i = \frac{r_i - \text{mean}(\mathbf{r})}{\text{std}(\mathbf{r})}.
  4. Policy is updated using a surrogate loss with KL regularization to the SFT reference, avoiding an explicit value network:

    JGRPO(θ)=Eq,{oi}[1Gi=1G1oit=1oi(min[πθ(oi,tq,oi,<t)πθold(oi,tq,oi,<t)A~i,t,clip()]βDKL[πθπref])]\mathcal{J}_{\text{GRPO}}(\theta) = \mathbb{E}_{q, \{o_i\}} \left[ \frac{1}{G} \sum_{i=1}^G \frac{1}{|o_i|} \sum_{t=1}^{|o_i|} \left( \min\left[ \frac{ \pi_\theta(o_{i,t} | q, o_{i,<t}) }{\pi_{\theta_{\text{old}}}(o_{i,t} | q, o_{i,<t}) } \tilde{A}_{i,t}, \text{clip}(\ldots) \right] - \beta \mathbb{D}_{KL}[\pi_\theta || \pi_{\text{ref}}] \right) \right]

  5. Process supervision: Stepwise per-token rewards can be incorporated for detailed guidance, rather than solely at the completion of the answer.

Advantages of this approach are reduced memory requirements (no separate value model), tighter reward alignment, and increased stability due to within-batch normalization. Empirically, ablations demonstrate that GRPO and iterative online RL outperform DPO/RFT baselines and enable robust learning from preference-based reward signals.

3. Mathematical Reasoning Benchmark Performance

Standard Benchmarks

DeepSeekMath 7B achieves high performance on the most challenging public math benchmarks:

Model Parameters MATH (Top-1) GSM8K Notes
GPT-4 Proprietary 52.9% 92.0% Reference
Gemini-Ultra Proprietary 53.2% 94.4% Reference
DeepSeekMath-RL 7B 7B 51.7% 88.2% Open, 7B, RL-tuned
InternLM2-Math 20B 20B 37.7% N/A Open
WizardMath-v1.1 7B 7B 33.0% N/A Open
  • Self-Consistency (Maj@64) on MATH: 60.9% for DeepSeekMath-RL 7B.
  • Multilingual Superiority: Consistent leadership on Chinese math benchmarks due to the multilingual training corpus.
  • Architecture-Driven Gains: Initializing from a code LLM empirically yields stronger results than from a vanilla LLM, both with and without recourse to tool use.

Comparative Analysis

DeepSeekMath 7B constitutes a substantial advance over prior open models (e.g., Llemma-34B, Minerva-540B), narrowing the gap with closed-source models even at 7B parameters. It demonstrates that mathematical pre-training inflates both general reasoning and cross-domain code capability—contradicting previous expectations regarding the utility of arXiv text for mathematical LLMs, which was found to provide negligible gains.

4. Failed Modes and Known Limitations

Despite the progress, DeepSeekMath 7B exhibits certain bottlenecks:

  • Complexity Limitation: The model tends to plateau on problems requiring multi-step, compositional, or error-detection-and-correction reasoning, with performance constrained by its inability to monitor and revise intermediate logic.
  • RL Limitation: On highly saturated benchmarks (e.g., MATH-500), RL-based approaches predominantly "sharpen" the existing solution set, failing to discover genuinely novel solution modes, a limitation prominently exposed by the MATH-Beyond (MATH-B) benchmark (Mayilvahanan et al., 13 Oct 2025).

On MATH-B splits constructed to defeat base models even with pass@1024 sampling, DeepSeekMath 7B (as DeepSeek-R1-Qwen2.5-7B) achieves zero solutions, delineating the true reasoning boundary. Even with advanced RL-fine-tuned models (e.g., Skywork-OR1-7B), expansion into unsolved MATH-B is limited to 21.21% at best, and SFT or CoT-distilled student models from larger teachers achieve higher expansion (up to 66.38% for Qwen3-8B), indicating RL's limited capacity for expansion without stronger supervision.

5. Methodological Contributions

Beyond performance, DeepSeekMath 7B advances several methodological directions:

  • Data Curation at Scale: Demonstrates the viability and impact of classifier-driven, web-scale math data extraction, surpassing the scale and diversity of precedent LLM math corpora.
  • Unified Training Loss Formulation: Presents a generalized loss gradient framework subsuming SFT, DPO, PPO, and GRPO, with the critical insight that group-based within-batch baseline estimation suits the comparative feedback structures of math reward models:

θJA(θ)=E(q,o)D[1ot=1oGCA(q,o,t)θlogπθ(otq,o<t)]\nabla_{\theta}\mathcal{J}_\mathcal{A}(\theta) = \mathbb{E}_{(q,o)\sim \mathcal{D}} \left[ \frac{1}{|o|} \sum_{t=1}^{|o|} GC_\mathcal{A}(q,o,t) \nabla_\theta \log \pi_\theta(o_t \mid q, o_{<t}) \right]

where GCAGC_\mathcal{A} corresponds to the method-specific gradient coefficient.

  • Algorithmic Efficiency: GRPO offers orders-of-magnitude improvements in memory and compute efficiency over PPO, facilitating long-horizon RL in LLMs.

6. Impact, Availability, and Future Directions

DeepSeekMath 7B’s release (code and data: https://github.com/deepseek-ai/DeepSeek-Math) establishes an open baseline for math reasoning at moderate scale, spurring further innovation in math-focused LLM research. The architectural and algorithmic advances (e.g., code-first pretraining, GRPO, classifier-mined datasets) have influenced subsequent efforts, such as the SuperCorrect-7B system (Yang et al., 11 Oct 2024), which adapts DeepSeekMath 7B via hierarchical thought template distillation and cross-model collaborative DPO to further enhance error correction and generalization.

Current research is prioritizing:

  • Improved error localization and correction in reasoning chains
  • Data curation strategies for broader and more balanced problem coverage
  • Advanced reward modeling for RL (with greater exploration encouragement)
  • Scaling model and context size to approach few-shot and tool-use parity with leading closed models

Summary Table: Design and Innovations

Component Key Feature Impact
Architecture 7B param transformer, code-initialized Efficient math reasoning
Data 120B+ tokens, classifier-mined, decontaminated High-quality, multilingual coverage
RL Method GRPO (group-normalized, reference-free) Memory/computation gains, robust RL
Benchmarks MATH: 51.7%, GSM8K: 88.2% SOTA among open 7B models
Algorithmic Flow SFT → RL (GRPO), process supervision Versatile, modular training

DeepSeekMath 7B represents a foundational point in open-access mathematical reasoning research, delivering both methodological and empirical advances and highlighting the critical barriers that remain in the automatic solution of complex, competition-level mathematics.

Forward Email Streamline Icon: https://streamlinehq.com

Follow Topic

Get notified by email when new papers are published related to DeepSeekMath 7B Model.