MathPile: A Billion-Token-Scale Pretraining Corpus for Math

Published 28 Dec 2023 in cs.CL, cs.AI, and cs.LG | (2312.17120v2)

Abstract: High-quality, large-scale corpora are the cornerstone of building foundation models. In this work, we introduce MathPile, a diverse and high-quality math-centric corpus comprising about 9.5 billion tokens. Throughout its creation, we adhered to the principle of "less is more", firmly believing in the supremacy of data quality over quantity, even in the pre-training phase. Our meticulous data collection and processing efforts included a complex suite of preprocessing, prefiltering, language identification, cleaning, filtering, and deduplication, ensuring the high quality of our corpus. Furthermore, we performed data contamination detection on downstream benchmark test sets to eliminate duplicates and conducted continual pre-training experiments, booting the performance on common mathematical reasoning benchmarks. We aim for our MathPile to boost LLMs' mathematical reasoning abilities and open-source its different versions and processing scripts to advance the field.

Abstract PDF HTML Upgrade to Chat

Authors (4)

References (64)

Citations (27)

View on Semantic Scholar

Summary

The paper introduces MathPile, a comprehensive corpus curated from diverse mathematical sources totaling 9.5 billion tokens.
The authors detail rigorous cleaning, deduplication, and contamination detection methods tailored to the nuances of mathematical text.
The corpus, enriched with structural annotations, is designed to enhance AI performance on mathematical reasoning tasks.

Introduction to the Corpus

This paper introduces MATH PILE, a comprehensive and multifaceted corpus designed specifically for the mathematical domain. The corpus contains approximately 9.5 billion tokens collected from a variety of sources, each with a strong mathematical focus. These sources include academic papers from arXiv, textbooks, ProofWiki entries, discussions from StackExchange, and web pages from Common Crawl.

Corpus Characteristics and Quality

To ensure the high quality of MATH PILE, a multi-step data processing regimen was executed. This included language identification to filter out non-English documents and a set of heuristic rules for cleaning and filtering, tailored to the unique aspects of mathematical text. Deduplication processes further refined the corpus by removing duplicative content from within and across sources. Additionally, data contamination detection was performed to eliminate instances found in downstream benchmark test sets, an often overlooked but critical step in corpus construction.

Structural Diversity and Annotations

The documents within MATH PILE are diverse not only in content but also in structure. Content ranges from the definitions and theorem proofs to more narrative styles found in textbook excerpts and website articles. They are enriched with annotations that enhance their utility for AI tasks, providing metadata such as language identification scores and the ratio of symbols to words, which future users can leverage for tailored filtering.

Utility and Future Directions

The creators aspire for MATH PILE to be a valuable asset for advancing AI's capabilities in mathematical reasoning. Envisaged as either a standalone tool or in conjunction with other general domain corpora, MATH PILE is well-positioned to improve AI systems' performance on various mathematically-oriented tasks. Moreover, ongoing updates and expansions to the corpus are planned, fostering continuous improvement and widening applicability.

Markdown Report Issue