DeepSeekMath Corpus: High-Quality Math Data

Updated 4 August 2025

DeepSeekMath Corpus is a rigorously curated, large-scale dataset containing 120 billion tokens, constructed via an iterative classifier-driven extraction process.
It covers diverse mathematical domains, including quantitative reasoning, formal proofs, numerical algebra, and competition problems, ensuring broad representation.
Through advanced cleaning, deduplication, and contamination filtering, the corpus underpins models like DeepSeekMath 7B, significantly enhancing step-by-step problem solving.

The DeepSeekMath Corpus is a large-scale, high-quality, web-mined mathematical text dataset designed to serve as the pre-training backbone for advanced LLMs with mathematical reasoning capabilities. Originally developed to fuel the DeepSeekMath 7B model, the corpus encompasses 120 billion math-related tokens, making it among the largest open math-specific corpora available to date. Its construction emphasizes both the quantity and the quality of mathematical data, incorporating rigorous filtration, deduplication, contamination removal, and an iterative data selection process informed by machine learning classifiers. The corpus provides broad coverage of mathematical domains, including quantitative reasoning, formal proofs, numerical algebra, and competition-level problem solving.

1. Corpus Construction and Selection Pipeline

The assembly of the DeepSeekMath Corpus employs an iterative, classifier-driven extraction process from Common Crawl datasets. The methodology can be summarized as follows:

Seed Corpus: The pipeline initiates with OpenWebMath, a vetted math-rich web text collection, as the positive seed.
Classifier Training: A fastText classifier is instantiated by sampling 500k positive examples (from the seed) and 500k negatives (from non-mathematical web pages), producing 256-dimensional multi-gram embeddings.
Scoring and Ranking: All deduplicated Common Crawl web pages are scored using the trained classifier. Pages are ranked by their predicted “math score”—pages with the highest likelihood of containing substantive mathematical content are prioritized.
Seed Diversification: To enhance coverage, the seed expands by manual inspection of domain-level math density. Any domain with over 10% math-relevant material (e.g., mathoverflow.net) is incorporated, along with hand-curated URL-pattern lists for missed but valuable sources.
Selection Iterations: Top-ranking pages are added in several rounds, eventually aggregating 35.5 million mathematical documents.
Deduplication: Near-duplicate removal is performed at the n-gram level to ensure data diversity and to avoid redundancy.
Contamination Filtering: To prevent benchmark leakage, the corpus is purged of any document containing sequences (typically 10-grams or longer) matching downstream evaluation sets such as MATH, GSM8K, or MMLU-STEM. This step ensures a fair evaluation of models pre-trained on the corpus.

This pipeline targets high recall and precision for authentic mathematical content, outstripping prior math corpora in both the volume and the rigor of its curation techniques (Shao et al., 5 Feb 2024).

2. Scope, Coverage, and Statistical Characteristics

Drawing solely from high-scoring webpages identified through the classifier-guided selection, the DeepSeekMath Corpus embraces a diverse spectrum of mathematical disciplines and genres:

Content Types: The corpus spans didactic exposition, competition problems, research-level proofs, programmatic solutions (code), and discussions from both formal and informal mathematical fora.
Language and Format: The core data is heavily LaTeX-user-centric, reflecting usage in web forums, preprint repositories, and educational sites. Formal settings such as proofs and stepwise reasoning are particularly emphasized.
Data Scale: The final corpus aggregates 120 billion tokens (56% of the entire token count used in DeepSeekMath 7B pre-training), substantially exceeding comparable resources. For context, OpenWebMath provides approximately 14.5B tokens, while MegaMath—another large-scale corpus—carries 371B tokens but with different collection strategies (Zhou et al., 3 Apr 2025).

By extracting a wide range of document types and mathematical formats, the corpus ensures substantial representation for both Western and Chinese mathematical sources, providing broad linguistic and topical diversity.

3. Pre-Processing, Deduplication, and Contamination Control

The pre-processing framework applies state-of-the-art cleaning and deduplication strategies:

Text Cleaning: Documents undergo normalization and artifact removal, including the deletion of non-mathematical noise, boilerplate, and corrupted lines.
Structural Filtering: Mathematical expressions are parsed and normalized. Special attention is given to preserving LaTeX syntax, but the pipeline supports structural formula detection, as seen in related works such as MegaMath (Zhou et al., 3 Apr 2025).
Deduplication: Locality Sensitive Hashing (LSH) and fast n-gram matching are used to identify and remove near-duplicates across the millions of selected web pages, thereby ensuring both intra- and inter-source uniqueness.
Contamination Detection: To ensure validity of downstream benchmarks, strict line-level n-gram matching is performed between the corpus and task evaluation sets. Any overlap triggers removal of the contaminated sample, thus guaranteeing statistically clean pre-training and reliable model assessment.

This level of pre-processing is critical for the development of foundation mathematical reasoning models, as it directly impacts both generalization and measurement fidelity.

4. Role in Model Development and Performance Impact

The DeepSeekMath Corpus directly underpins the training of DeepSeekMath 7B—a LLM specifically oriented towards mathematical problem solving.

Training Regimen: DeepSeekMath 7B continues pre-training on DeepSeek-Coder-Base-v1.5 7B for a total of 500B tokens, with 56% sourced from the DeepSeekMath Corpus.
Performance Benchmarks: On the MATH competition benchmark, DeepSeekMath 7B achieves a top-1 accuracy of 36.2% in the chain-of-thought setting and 51.7% in program-of-thought, approaching the performance of leading closed models such as Gemini-Ultra and GPT-4. When leveraging 64-sample self-consistency voting, the accuracy reaches 60.9%.
Generalization: The corpus’s breadth is reflected in the observed robustness of mathematical reasoning across problem types and languages (notably English and Chinese), outperforming prior open-source math-focused pre-training sets.

A key property linked to the corpus scale and content quality is that models trained on DeepSeekMath’s data demonstrate enhanced capacity for step-by-step quantitative reasoning, formal deduction, and symbolic manipulation—all enabled by the diversity and decontaminated nature of the training data.

5. Comparisons and Relationships to Other Math Corpora

Relative to other major math corpora, the DeepSeekMath Corpus distinguishes itself in scale, curation rigor, and documented impact on model performance:

Corpus	Tokens (B)	Major Features
DeepSeekMath	120	Iterative selection; strict filtering
MathPile	~9.5	Quality-focused; diverse, processed
MegaMath	371	Coarse-to-fine web/code/synthetic
OpenWebMath	~14.5	Math web snippets; initial seed

Although MegaMath is larger, its collection emphasizes quantity (with fine-tuned SLM code selection, synthetic data, and two-stage extraction), while DeepSeekMath is more targeted, using active learning to maximize math token relevance and rigorous benchmark decontamination (Zhou et al., 3 Apr 2025, Wang et al., 2023).

A plausible implication is that while larger corpora (e.g., MegaMath) offer broader math code coverage and synthetic diversity, DeepSeekMath’s pipeline yields a higher-quality signal for the specific demands of mathematical reasoning.

6. Future Directions and Integration

Planned improvements for the DeepSeekMath Corpus include:

Domain Bias Correction: Addressing current underrepresentation of geometric reasoning and formal theorem proving by enriching seed corpora and selection criteria.
Fine-Grained Data Tagging: Incorporating metadata for domain, topic, and language to enable conditional fine-tuning and filtering for downstream tasks.
Hybrid Corpus Strategies: Potentially integrating retrieval-augmented paradigms and formal language resources (e.g., Lean formalizations) (Zayyad et al., 21 Dec 2024), notably for applications in automated theorem proving and symbolic computation.
Open Release: Future iterations of the corpus and its construction scripts are expected to be released for reproducibility and community-driven extension.

This suggests that DeepSeekMath may evolve into an even more versatile resource, offering modular integration with formal language corpora and multi-modal retrieval systems to support the next generation of mathematical reasoning models.

In sum, the DeepSeekMath Corpus embodies a systematic, large-scale, and quality-oriented approach to mathematical text curation for LLM training. Its influence is seen not only in the competitive performance of models such as DeepSeekMath 7B but also in setting standards for contamination control, deduplication, and multi-domain coverage in mathematical corpora (Shao et al., 5 Feb 2024).