Nemotron-CC-Math: Superior Math Pretraining
- Nemotron-CC-Math is a large-scale, high-quality math pretraining corpus characterized by a novel, layout-aware extraction pipeline that preserves mathematical notation and code blocks.
- Its methodology employs Lynx-based rendering and an LLM-driven cleaning process with Phi-4 to standardize equations into LaTeX and remove noisy content.
- Scaling to variants with up to 133 billion tokens, the corpus demonstrates significant benchmark improvements in math and code reasoning for large language models.
Nemotron-CC-Math is a large-scale, high-quality mathematical pretraining corpus developed to enhance the reasoning capabilities of LLMs. Constructed from Common Crawl web data using a novel, layout-preserving and LLM-driven extraction pipeline, Nemotron-CC-Math specifically addresses deficiencies observed in prior math datasets, such as loss of equation structure, encoding errors, and noisy content. It is distinguished by its substantial scale—offering two variants, Nemotron-CC-Math-3+ at 133 billion tokens and Nemotron-CC-Math-4+ at 52 billion tokens—and by its rigorous document processing that standardizes mathematical notation (especially to LaTeX), preserves code blocks, and yields measurable improvements across competitive math, code, and reasoning benchmarks.
1. Dataset Construction Methodology
Nemotron-CC-Math is built from technical web sources identified within Common Crawl, targeting math-centric URLs originating from resources such as OpenWebMath, FineMath, and MegaMath. The extraction pipeline is modular and domain-agnostic, employing multiple recent crawl snapshots to maximize coverage. Core to the process is layout-aware rendering using Lynx, a text-based browser that applies HTML layout rules in a fashion that preserves equation placement and code block structure, crucial for scientific text.
Following rendering, a lightweight LLM (Phi-4) executes a semantic cleaning stage: boilerplate or irrelevant content is pruned, mathematical notation is standardized into LaTeX, and typographical errors are corrected. Quality classification is applied using a FineMath classifier with a 5-point score; only pages of sufficient quality (score ≥ 3–5) are retained. Fuzzy deduplication (using MinHash) and contamination filtering further remove near–duplicate and non-mathematics content, ensuring document uniqueness and technical fidelity.
2. Handling Mathematical Structure and Formats
Mathematical content on the web is highly heterogeneous—appearing in MathJax, KaTeX, MathML, raw inline LaTeX, or as image-based equations. The Nemotron-CC-Math pipeline leverages Lynx’s rendering to preserve layout and context, capturing all forms of equations and code without loss from conversion errors typical in pure HTML-to-text scrapers.
A targeted LLM-based cleaning stage (Phi-4) then unifies all mathematical expressions into a canonical LaTeX format using dollar-sign delimiters. This harmonization allows downstream models to ingest complex formulas with standardized notation and without ambiguity in encoding. Code blocks, indentation, and document hierarchy are similarly preserved, providing high-fidelity input for pretraining scientific LLMs.
3. Corpus Scale and Quality
Nemotron-CC-Math is released in two key variants:
- Nemotron-CC-Math-3+: 133 billion tokens, ~101 million documents, quality scores 3–5.
- Nemotron-CC-Math-4+: 52 billion tokens, ~45 million documents, quality scores 4–5.
A comparison with prior open datasets illustrates the scale advantage: Nemotron-CC-Math-4+ is 5.5 times larger than FineMath-4+, the next largest high-quality math source, and substantially exceeds MegaMath and OpenWebMath in both token count and content integrity. The quality assurance pipeline, employing both semantic and syntactic evaluation, yields documents with reliably preserved mathematical and code structure.
Dataset | Tokens (Billions) | Quality Measure |
---|---|---|
Nemotron-CC-Math-4+ | 52 | 4–5 FineMath score |
FineMath-4+ | ~9.5 | 4+ quality |
MegaMath-Web | ~21 | — |
This scale and quality are foundational for improved downstream generalization.
4. Effects on Model Performance
Pretraining experiments utilizing Nemotron-CC-Math in an 8B-parameter model (Nemotron-T) reveal substantial gains in both math and code reasoning benchmarks. When trained on 100B tokens, Nemotron-CC-Math-4+ yields +4.8 improvement on the MATH benchmark and +4.6 to +14.3 on MBPP+ code generation tasks compared to models trained on FineMath-4+. At higher token counts (300B), the gains extend further: MATH scores reach 44.2 (+9.6 over FineMath-3+, +12.6 over MegaMath-Web).
These improvements are not limited to mathematics-oriented tasks. There are also benefits on general knowledge benchmarks such as MMLU and MMLU-STEM, demonstrating enhanced reasoning capacity due to superior structural fidelity and diversity in the pretraining data.
Performance Benefit (Latex notation):
A plausible implication is that high-fidelity math data augments both symbolic reasoning and general problem-solving in LLMs.
5. Pipeline Innovations and Data Integrity
Nemotron-CC-Math’s pipeline introduces non–heuristic, layout-aware extraction and semantic cleaning, contrasting sharply with earlier methods reliant on brittle HTML parsing. By using “Lynx” for HTML layout rendering and an LLM for content cleaning, the pipeline is robust against format variation and page structure inconsistency, effectively recovering scientific content from noisy, web-scale sources.
Fuzzy deduplication with MinHash and multi-stage quality classification (including a numeric FineMath score and contamination filters) further distinguish the approach, ensuring documented uniqueness and high technical value.
6. Open-Source Release and Research Community Impact
Both the Nemotron-CC-Math dataset and its extraction pipeline—including all code—are openly released (GitHub and Hugging Face), enabling reproducibility, extensibility, and transparency. This contribution supports community experimentation, extension to new scientific or technical domains, and a foundation for further innovation in LLM pretraining for reasoning tasks.
The ability to reliably extract high-quality scientific content from noisy web data sets a precedent for future corpus construction in LLMing.
7. Positioning Relative to Prior Datasets and Benchmarks
Nemotron-CC-Math is explicitly positioned as setting a new state of the art among open math pretraining corpora. Its methodological advances and benchmark improvements make it preferable for researchers seeking to enhance technical reasoning in LLMs.
Supplementary analyses show benefits extend to code and reasoning tasks; this pattern reinforces the central thesis that well-structured, large-scale mathematical data is a critical underpinning for advanced LLM capabilities.
In sum, Nemotron-CC-Math constitutes a substantial advance in mathematical pretraining corpora for LLMs, offering innovations in layout-aware extraction, LLM-driven cleaning, rigorous deduplication and classification, and demonstrable improvements in reasoning capability across math, code, and broader domains. The open-source release enables community adoption and further methodological development.