DeepSeekMath 7B: Advanced Math LLM
- DeepSeekMath 7B is a 7-billion-parameter open-source model optimized for advanced mathematical reasoning with a curated math corpus and integrated code data.
- It employs Group Relative Policy Optimization (GRPO) to stabilize training, reduce GPU memory usage, and enhance logical inference.
- Empirical results show 51.7% chain-of-thought accuracy and 60.9% self-consistency on the MATH benchmark, marking strong competition-level performance.
DeepSeekMath 7B is an open-source, 7-billion-parameter LLM specifically optimized for advanced mathematical reasoning. Developed via continued pre-training of a code-oriented LLM (DeepSeek-Coder-Base-v1.5 7B), DeepSeekMath 7B integrates dedicated mathematical corpora with natural language and code data, and employs Group Relative Policy Optimization (GRPO), a sample-efficient reinforcement learning method, to enhance mathematical inference while optimizing resource utilization. The model achieves leading open-source performance on competition-level math benchmarks, approaching the capabilities of state-of-the-art proprietary systems.
1. Model Architecture and Training Corpus
DeepSeekMath 7B builds on the DeepSeek-Coder-Base-v1.5 7B Transformer backbone, inheriting its architectural design and the benefits of extensive code pre-training. It undergoes continued pre-training on a curated mixture of mathematical, code, and natural language content. The aggregate pre-training data totals 500B tokens, with the breakdown as follows:
- 56% from the DeepSeekMath Corpus (120B tokens), mined from Common Crawl and filtered for mathematical relevance using an iterative fastText classifier and human-in-the-loop refinement.
- 20% from GitHub code repositories, maintaining programming proficiency.
- 10% from arXiv scientific papers.
- 10% from Common Crawl natural language (English/Chinese).
- 4% from AlgebraicStack.
The DeepSeekMath Corpus introduces diversity spanning elementary to advanced mathematical topics, leveraging iterative classifier training (fastText) and progressive human refinement to maximize data quality and domain coverage. Training continues in the standard Transformer autoregressive paradigm optimized for causal LLMing.
2. Methodological Innovations
DeepSeekMath 7B introduces two principal methodological advances:
- Meticulously engineered math corpus selection: The data selection pipeline starts from math-specific seed sets (e.g., OpenWebMath), iteratively updates a fastText classifier for math detection, and incorporates manual annotation at the URL/domain level to ensure high precision and recall for mathematical documents within Common Crawl.
- Group Relative Policy Optimization (GRPO): As a variant of canonical Proximal Policy Optimization (PPO), GRPO enables reinforcement learning without a value network (critic). For each math question, groups of candidate completions are sampled; relative rewards are computed within these groups, and gradient updates are weighted accordingly. This group-wise baseline stabilizes optimization and substantially reduces the GPU memory footprint required for PPO. The objective incorporates a KL divergence regularizer for stability.
These innovations together target superior mathematical reasoning capacity and efficient training scalability.
3. Mathematical Reasoning Capabilities and Empirical Performance
Evaluation on the MATH benchmark—a suite of competition-level math problems—demonstrates strong open-source results:
- Chain-of-thought accuracy (single sample): 51.7% on MATH, without tool augmentation or majority voting, approaching proprietary models (Gemini-Ultra, GPT-4).
- Self-consistency (majority answer from 64 samples): 60.9% on MATH, indicating improved reliability through sampling and consensus.
DeepSeekMath 7B also performs competitively on GSM8K and miniF2F (informal-to-formal theorem proving), indicating strong generalization to diverse mathematical subdomains and reasoning depths.
The model’s outputs exhibit chain-of-thought reasoning, producing multi-step, self-contained solutions incorporating formal mathematical notation (including LaTeX-rendered expressions and programmatic calculations). Program-of-thought instruction tuning enables it to seamlessly integrate code execution (e.g., for numeric integration or algebraic computation) within textual solutions.
4. Data Selection and Training Pipeline
The mathematical corpus selection pipeline is a multi-step process involving:
- Initial seed collection (e.g., OpenWebMath) as a positive base for classifier training.
- Training of a fastText binary classifier on this base to rank and select relevant pages from Common Crawl.
- Iterative retraining of the classifier as the corpus grows, with continued human oversight to eliminate false positives and maximize coverage.
- Final dataset curation, with token-level and document-level filtering, ensures a wide range of mathematical problems, diverse presentation formats, and topic stratification.
This large-scale, high-purity dataset is then mixed with other data modalities for final continued pre-training, followed by reinforcement learning (GRPO) for mathematical reasoning optimization.
5. Reinforcement Learning with Group Relative Policy Optimization
GRPO operates as follows:
- For each math prompt, a group (batch) of model completions is generated.
- Completions are awarded relative group-wise rewards (e.g., based on answer accuracy or step-wise logical correctness).
- Instead of maintaining a separate critic network estimating value functions, GRPO computes a baseline within each group (e.g., average or maximal reward) and updates policy gradients with a relative rather than absolute advantage.
- A KL regularization term is added to control policy shifts.
- This methodology reduces both computation and memory cost—parameters are updated using only group-level statistics—while better matching the comparative nature of mathematical answer evaluation.
This variant differs fundamentally from classical PPO and is tailored to aligning mathematical reasoning with reward signals derived from mathematical correctness and logical plausibility.
6. Limitations, Comparative Results, and Future Directions
While DeepSeekMath 7B sets the benchmark among open-source 7B models in math (notably outperforming models relying solely on code or general text like Llama or Mistral 7B), closed-source models retain notable leads in geometry and formal theorem-proving. The inability to match Gemini-Ultra and GPT-4 in specialized domains is attributed to the composition of pre-training data and limitations in RL reward modeling.
Notable limitations and prospective improvements include:
- Domain-Specific Augmentation: Expanded or more specialized geometry/logic datasets could address observed weaknesses.
- Iterative or Meta-RL approaches: Potential exists to further enhance sample efficiency and detail-oriented reasoning through iterative RL or adaptive reward shaping.
- Verifier-Guided Search Flaws: As revealed in (Yu et al., 1 Feb 2025), DeepSeekMath 7B exhibits scaling flaws in verifier-guided search. While verifier-based beam searches outperform random sampling at low sample budgets, their effectiveness degrades as sample size grows, due to imperfect verifier misranking and premature candidate pruning.
The model's use of chain-of-thought and program-of-thought instruction tuning, combined with GRPO, differentiates its optimization approach compared to approaches focusing solely on instruction-tuning or supervised data scaling.
7. Practical Applications and Impact
DeepSeekMath 7B provides a high-quality, openly accessible baseline for tasks requiring advanced mathematical reasoning, such as:
- Automated problem solving and proof generation in math education and competitions.
- Augmented scientific discovery through mathematical inference in the sciences and engineering.
- Integration with symbolic computational tools for programmatic theorem proving.
- Research environments requiring explainable, multi-step mathematical outputs, formalization, and code interleaving.
Released under an open and permissive license, DeepSeekMath 7B extends the reach of high-performance mathematical LLMing to academic research and open-source engineering, accelerating methodological progress in interpretable and tool-enabled reasoning.