Papers

Topics

Authors

Recent

View all

Gemini 2.5 Flash

119 tokens/sec

GPT-4o

56 tokens/sec

Gemini 2.5 Pro Pro

43 tokens/sec

o3 Pro

6 tokens/sec

GPT-4.1 Pro

47 tokens/sec

DeepSeek R1 via Azure Pro

28 tokens/sec

2000 character limit reached

754 1 414

MathPile: A Billion-Token-Scale Pretraining Corpus for Math (2312.17120v2)

Published 28 Dec 2023 in cs.CL, cs.AI, and cs.LG

Abstract: High-quality, large-scale corpora are the cornerstone of building foundation models. In this work, we introduce MathPile, a diverse and high-quality math-centric corpus comprising about 9.5 billion tokens. Throughout its creation, we adhered to the principle of "less is more", firmly believing in the supremacy of data quality over quantity, even in the pre-training phase. Our meticulous data collection and processing efforts included a complex suite of preprocessing, prefiltering, language identification, cleaning, filtering, and deduplication, ensuring the high quality of our corpus. Furthermore, we performed data contamination detection on downstream benchmark test sets to eliminate duplicates and conducted continual pre-training experiments, booting the performance on common mathematical reasoning benchmarks. We aim for our MathPile to boost LLMs' mathematical reasoning abilities and open-source its different versions and processing scripts to advance the field.

References (64)

Authors (4)

Zengzhi Wang (13 papers)
Rui Xia (53 papers)
Pengfei Liu (191 papers)
Xuefeng Li (36 papers)

Citations (27)

View on Semantic Scholar

Summary

Introduction to the Corpus

This paper introduces MATH PILE, a comprehensive and multifaceted corpus designed specifically for the mathematical domain. The corpus contains approximately 9.5 billion tokens collected from a variety of sources, each with a strong mathematical focus. These sources include academic papers from arXiv, textbooks, ProofWiki entries, discussions from StackExchange, and web pages from Common Crawl.

Corpus Characteristics and Quality

To ensure the high quality of MATH PILE, a multi-step data processing regimen was executed. This included language identification to filter out non-English documents and a set of heuristic rules for cleaning and filtering, tailored to the unique aspects of mathematical text. Deduplication processes further refined the corpus by removing duplicative content from within and across sources. Additionally, data contamination detection was performed to eliminate instances found in downstream benchmark test sets, an often overlooked but critical step in corpus construction.

Structural Diversity and Annotations

The documents within MATH PILE are diverse not only in content but also in structure. Content ranges from the definitions and theorem proofs to more narrative styles found in textbook excerpts and website articles. They are enriched with annotations that enhance their utility for AI tasks, providing metadata such as language identification scores and the ratio of symbols to words, which future users can leverage for tailored filtering.

Utility and Future Directions

The creators aspire for MATH PILE to be a valuable asset for advancing AI's capabilities in mathematical reasoning. Envisaged as either a standalone tool or in conjunction with other general domain corpora, MATH PILE is well-positioned to improve AI systems' performance on various mathematically-oriented tasks. Moreover, ongoing updates and expansions to the corpus are planned, fostering continuous improvement and widening applicability.

PDF Markdown

GitHub

GitHub - GAIR-NLP/MathPile: Generative AI for Math: MathPile (414 stars)

Tweets

https://twitter.com/794433401591693312/status/1740564961032556942

https://twitter.com/2465283662/status/1740571256234057798

https://twitter.com/1637708085958696961/status/1741088660311912571

https://twitter.com/279718877/status/1741974001503653909

https://twitter.com/1186620634803032064/status/1740871995020161256

https://twitter.com/1487345386226614278/status/1740657367577223343

YouTube

Show All Videos