MathPile: A Billion-Token-Scale Pretraining Corpus for Math (2312.17120v2)
Abstract: High-quality, large-scale corpora are the cornerstone of building foundation models. In this work, we introduce MathPile, a diverse and high-quality math-centric corpus comprising about 9.5 billion tokens. Throughout its creation, we adhered to the principle of "less is more", firmly believing in the supremacy of data quality over quantity, even in the pre-training phase. Our meticulous data collection and processing efforts included a complex suite of preprocessing, prefiltering, language identification, cleaning, filtering, and deduplication, ensuring the high quality of our corpus. Furthermore, we performed data contamination detection on downstream benchmark test sets to eliminate duplicates and conducted continual pre-training experiments, booting the performance on common mathematical reasoning benchmarks. We aim for our MathPile to boost LLMs' mathematical reasoning abilities and open-source its different versions and processing scripts to advance the field.
- Miltiadis Allamanis. 2019. The adverse effects of code duplication in machine learning models of code. In Proceedings of the 2019 ACM SIGPLAN International Symposium on New Ideas, New Paradigms, and Reflections on Programming and Software, Onward! 2019, page 143–153, New York, NY, USA. Association for Computing Machinery.
- AllenAI. 2023. allenai/dolma · datasets at hugging face. https://huggingface.co/datasets/allenai/dolma.
- Anthropic. 2023. Anthropic \ introducing claude. https://www.anthropic.com/index/introducing-claude.
- Proofnet: Autoformalizing and formally proving undergraduate-level mathematics. CoRR, abs/2302.12433.
- Llemma: An open language model for mathematics. CoRR, abs/2310.10631.
- Jack Bandy and Nicholas Vincent. 2021. Addressing "documentation debt" in machine learning research: A retrospective datasheet for bookcorpus. CoRR, abs/2105.05241.
- Emily M. Bender and Batya Friedman. 2018. Data statements for natural language processing: Toward mitigating system bias and enabling better science. Transactions of the Association for Computational Linguistics, 6:587–604.
- GPT-NeoX-20B: An open-source autoregressive language model. In Proceedings of BigScience Episode #5 – Workshop on Challenges & Perspectives in Creating Large Language Models, pages 95–136, virtual+Dublin. Association for Computational Linguistics.
- Andrei Z. Broder. 1997. On the resemblance and containment of documents. In Compression and Complexity of SEQUENCES 1997, Positano, Amalfitan Coast, Salerno, Italy, June 11-13, 1997, Proceedings, pages 21–29. IEEE.
- Language models are few-shot learners. In Advances in Neural Information Processing Systems 33: Annual Conference on Neural Information Processing Systems 2020, NeurIPS 2020, December 6-12, 2020, virtual.
- Sparks of artificial general intelligence: Early experiments with GPT-4. CoRR, abs/2303.12712.
- Cerebras. 2023. Slimpajama: A 627b token, cleaned and deduplicated version of redpajama - cerebras. http://tinyurl.com/slimpajama.
- Generative ai for math: Abel. https://github.com/GAIR-NLP/abel.
- Palm: Scaling language modeling with pathways. J. Mach. Learn. Res., 24:240:1–240:113.
- Training verifiers to solve math word problems. CoRR, abs/2110.14168.
- BERT: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 4171–4186, Minneapolis, Minnesota. Association for Computational Linguistics.
- Documenting large webtext corpora: A case study on the colossal clean crawled corpus. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 1286–1305, Online and Punta Cana, Dominican Republic. Association for Computational Linguistics.
- What’s in my big data? CoRR, abs/2310.20707.
- The pile: An 800gb dataset of diverse text for language modeling. CoRR, abs/2101.00027.
- Datasheets for datasets. Commun. ACM, 64(12):86–92.
- Similarity search in high dimensions via hashing. In VLDB’99, Proceedings of 25th International Conference on Very Large Data Bases, September 7-10, 1999, Edinburgh, Scotland, UK, pages 518–529. Morgan Kaufmann.
- Learning word vectors for 157 languages. In Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018), Miyazaki, Japan. European Language Resources Association (ELRA).
- Textbooks are all you need. CoRR, abs/2306.11644.
- Measuring massive multitask language understanding. In 9th International Conference on Learning Representations, ICLR 2021, Virtual Event, Austria, May 3-7, 2021. OpenReview.net.
- Measuring mathematical problem solving with the MATH dataset. In Proceedings of the Neural Information Processing Systems Track on Datasets and Benchmarks 1, NeurIPS Datasets and Benchmarks 2021, December 2021, virtual.
- Scaling laws and interpretability of learning from repeated data. CoRR, abs/2205.10487.
- Mistral 7b. CoRR, abs/2310.06825.
- Bag of tricks for efficient text classification. In Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 2, Short Papers, pages 427–431, Valencia, Spain. Association for Computational Linguistics.
- The stack: 3 TB of permissively licensed source code. CoRR, abs/2211.15533.
- Quality at a glance: An audit of web-crawled multilingual datasets. Transactions of the Association for Computational Linguistics, 10:50–72.
- The bigscience ROOTS corpus: A 1.6tb composite multilingual dataset. In NeurIPS.
- Deduplicating training data makes language models better. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 8424–8445, Dublin, Ireland. Association for Computational Linguistics.
- Retrieval-augmented generation for knowledge-intensive NLP tasks. In Advances in Neural Information Processing Systems 33: Annual Conference on Neural Information Processing Systems 2020, NeurIPS 2020, December 6-12, 2020, virtual.
- Solving quantitative reasoning problems with language models. In NeurIPS.
- Textbooks are all you need II: phi-1.5 technical report. CoRR, abs/2309.05463.
- Let’s verify step by step. CoRR, abs/2305.20050.
- Program induction by rationale generation: Learning to solve and explain algebraic word problems. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 158–167, Vancouver, Canada. Association for Computational Linguistics.
- A pretrainer’s guide to training data: Measuring the effects of data age, domain coverage, quality, & toxicity. CoRR, abs/2305.13169.
- Alexandra Luccioni and Joseph Viviano. 2021. What’s in the box? an analysis of undesirable content in the Common Crawl corpus. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 2: Short Papers), pages 182–189, Online. Association for Computational Linguistics.
- Udi Manber and Eugene W. Myers. 1993. Suffix arrays: A new method for on-line string searches. SIAM J. Comput., 22(5):935–948.
- Data statements: From technical concept to community practice. ACM J. Responsib. Comput. Just Accepted.
- Measuring data. CoRR, abs/2212.05129.
- Chenghaomou/text-dedup: Reference snapshot.
- OpenAI. 2022. Introducing chatgpt. https://openai.com/blog/chatgpt.
- OpenAI. 2023. GPT-4 technical report. CoRR, abs/2303.08774.
- Openwebmath: An open dataset of high-quality mathematical web text. CoRR, abs/2310.06786.
- Are NLP models really able to solve simple math word problems? In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 2080–2094, Online. Association for Computational Linguistics.
- The refinedweb dataset for falcon LLM: outperforming curated corpora with web data, and web data only. CoRR, abs/2306.01116.
- Improving language understanding by generative pre-training.
- Language models are unsupervised multitask learners.
- Scaling language models: Methods, analysis & insights from training gopher. CoRR, abs/2112.11446.
- Exploring the limits of transfer learning with a unified text-to-text transformer. J. Mach. Learn. Res., 21:140:1–140:67.
- Analysing mathematical reasoning abilities of neural models. In 7th International Conference on Learning Representations, ICLR 2019, New Orleans, LA, USA, May 6-9, 2019. OpenReview.net.
- BLOOM: A 176b-parameter open-access multilingual language model. CoRR, abs/2211.05100.
- Gemini: A family of highly capable multimodal models. arXiv preprint arXiv:2312.11805.
- Together. 2023a. Redpajama, a project to create leading open-source models, starts by reproducing llama training dataset of over 1.2 trillion tokens. https://www.together.ai/blog/redpajama.
- Together. 2023b. Redpajama-data-v2: An open dataset with 30 trillion tokens for training large language models. https://www.together.ai/blog/redpajama-data-v2.
- Llama: Open and efficient foundation language models. CoRR, abs/2302.13971.
- Llama 2: Open foundation and fine-tuned chat models. CoRR, abs/2307.09288.
- Finetuned language models are zero-shot learners. In The Tenth International Conference on Learning Representations, ICLR 2022, Virtual Event, April 25-29, 2022. OpenReview.net.
- Naturalproofs: Mathematical theorem proving in natural language. In Proceedings of the Neural Information Processing Systems Track on Datasets and Benchmarks 1, NeurIPS Datasets and Benchmarks 2021, December 2021, virtual.
- CCNet: Extracting high quality monolingual datasets from web crawl data. In Proceedings of the 12th Language Resources and Evaluation Conference, pages 4003–4012, Marseille, France. European Language Resources Association.
- LIMA: less is more for alignment. CoRR, abs/2305.11206.
- Aligning books and movies: Towards story-like visual explanations by watching movies and reading books. In 2015 IEEE International Conference on Computer Vision, ICCV 2015, Santiago, Chile, December 7-13, 2015, pages 19–27. IEEE Computer Society.
- Zengzhi Wang (13 papers)
- Rui Xia (53 papers)
- Pengfei Liu (191 papers)
- Xuefeng Li (36 papers)