Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
110 tokens/sec
GPT-4o
56 tokens/sec
Gemini 2.5 Pro Pro
44 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

MegaMath: Pushing the Limits of Open Math Corpora (2504.02807v1)

Published 3 Apr 2025 in cs.CL, cs.AI, and cs.LG

Abstract: Mathematical reasoning is a cornerstone of human intelligence and a key benchmark for advanced capabilities in LLMs. However, the research community still lacks an open, large-scale, high-quality corpus tailored to the demands of math-centric LLM pre-training. We present MegaMath, an open dataset curated from diverse, math-focused sources through following practices: (1) Revisiting web data: We re-extracted mathematical documents from Common Crawl with math-oriented HTML optimizations, fasttext-based filtering and deduplication, all for acquiring higher-quality data on the Internet. (2) Recalling Math-related code data: We identified high quality math-related code from large code training corpus, Stack-V2, further enhancing data diversity. (3) Exploring Synthetic data: We synthesized QA-style text, math-related code, and interleaved text-code blocks from web data or code data. By integrating these strategies and validating their effectiveness through extensive ablations, MegaMath delivers 371B tokens with the largest quantity and top quality among existing open math pre-training datasets.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (8)
  1. Fan Zhou (111 papers)
  2. Zengzhi Wang (13 papers)
  3. Nikhil Ranjan (3 papers)
  4. Zhoujun Cheng (19 papers)
  5. Liping Tang (23 papers)
  6. Guowei He (19 papers)
  7. Zhengzhong Liu (28 papers)
  8. Eric P. Xing (192 papers)

Summary

We haven't generated a summary for this paper yet.