Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash 88 tok/s
Gemini 2.5 Pro 54 tok/s Pro
GPT-5 Medium 18 tok/s
GPT-5 High 27 tok/s Pro
GPT-4o 97 tok/s
GPT OSS 120B 451 tok/s Pro
Kimi K2 212 tok/s Pro
2000 character limit reached

Llemma: An Open Language Model For Mathematics (2310.10631v3)

Published 16 Oct 2023 in cs.CL, cs.AI, and cs.LO

Abstract: We present Llemma, a LLM for mathematics. We continue pretraining Code Llama on the Proof-Pile-2, a mixture of scientific papers, web data containing mathematics, and mathematical code, yielding Llemma. On the MATH benchmark Llemma outperforms all known open base models, as well as the unreleased Minerva model suite on an equi-parameter basis. Moreover, Llemma is capable of tool use and formal theorem proving without any further finetuning. We openly release all artifacts, including 7 billion and 34 billion parameter models, the Proof-Pile-2, and code to replicate our experiments.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (87)
  1. Santacoder: don’t reach for the stars! In Deep Learning for Code (DL4C) Workshop, 2023.
  2. GPT-NeoX: Large scale autoregressive language modeling in PyTorch. GitHub Repo, 9 2023. URL https://www.github.com/eleutherai/gpt-neox.
  3. Retrieval-based language models and applications. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 6: Tutorial Abstracts), pp.  41–46, Toronto, Canada, July 2023. Association for Computational Linguistics. doi: 10.18653/v1/2023.acl-tutorials.6. URL https://aclanthology.org/2023.acl-tutorials.6.
  4. Jeremy Avigad. The mechanization of mathematics. Notices of the AMS, 65(6):681–90, 2018.
  5. Proofnet: Autoformalizing and formally proving undergraduate-level mathematics. ArXiv, abs/2302.12433, 2023.
  6. Training a helpful and harmless assistant with reinforcement learning from human feedback. arXiv preprint arXiv:2204.05862, 2022.
  7. SciBERT: A pretrained language model for scientific text. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pp.  3615–3620, Hong Kong, China, November 2019. Association for Computational Linguistics. doi: 10.18653/v1/D19-1371. URL https://aclanthology.org/D19-1371.
  8. Pythia: A suite for analyzing large language models across training and scaling. In International Conference on Machine Learning, pp.  2397–2430. PMLR, 2023.
  9. Datasheet for the pile. ArXiv, abs/2201.07311, 2022.
  10. Gpt-neox-20b: An open-source autoregressive language model. In Proceedings of BigScience Episode# 5–Workshop on Challenges & Perspectives in Creating Large Language Models, pp.  95–136, 2022.
  11. Measuring progress on scalable oversight for large language models. arXiv preprint arXiv:2211.03540, 2022.
  12. Language models are few-shot learners. ArXiv, abs/2005.14165, 2020.
  13. Palm: Scaling language modeling with pathways. arXiv preprint arXiv:2204.02311, 2022.
  14. Training verifiers to solve math word problems. arXiv preprint arXiv:2110.14168, 2021.
  15. Evaluating language models for mathematics through interactions. arXiv preprint arXiv:2306.01694, 2023.
  16. Together Computer. Redpajama: An open source recipe to reproduce llama training dataset, April 2023. URL https://github.com/togethercomputer/RedPajama-Data.
  17. Tri Dao. Flashattention-2: Faster attention with better parallelism and work partitioning. arXiv preprint arXiv:2307.08691, 2023.
  18. The lean theorem prover (system description). In Automated Deduction-CADE-25: 25th International Conference on Automated Deduction, Berlin, Germany, August 1-7, 2015, Proceedings 25, pp.  378–388. Springer, 2015.
  19. The adventure of the errant hardware, 2023. URL https://www.adept.ai/blog/sherlock-sdc.
  20. Baldur: Whole-proof generation and repair with large language models. arXiv preprint arXiv:2303.04910, 2023.
  21. The pile: An 800gb dataset of diverse text for language modeling. ArXiv, abs/2101.00027, 2020.
  22. A framework for few-shot language model evaluation, September 2021. URL https://doi.org/10.5281/zenodo.5371628.
  23. Pal: Program-aided language models. arXiv preprint arXiv:2211.10435, 2022.
  24. Pal: Program-aided language models. arXiv preprint arXiv:2211.10435, 2023.
  25. Datasheets for datasets, 2021.
  26. Herbert L. Gelernter. Realization of a geometry theorem proving machine. In IFIP Congress, 1959. URL https://api.semanticscholar.org/CorpusID:18484295.
  27. Accurate, large minibatch SGD: training imagenet in 1 hour. CoRR, abs/1706.02677, 2017. URL http://arxiv.org/abs/1706.02677.
  28. Proof artifact co-training for theorem proving with language models. In International Conference on Learning Representations, 2022. URL https://openreview.net/forum?id=rpxJc9j04U.
  29. Measuring massive multitask language understanding. arXiv preprint arXiv:2009.03300, 2021a.
  30. Measuring mathematical problem solving with the math dataset. NeurIPS, 2021b.
  31. Training compute-optimal large language models. arXiv preprint arXiv:2203.15556, 2022.
  32. The curious case of neural text degeneration, 2020.
  33. Lisa: Language models of isabelle proofs. 6th Conference on Artificial Intelligence and Theorem Proving, 2021.
  34. Thor: Wielding hammers to integrate language models and automated theorem provers. arXiv preprint arXiv:2205.10893, 2022.
  35. Draft, sketch, and prove: Guiding formal theorem provers with informal proofs. In The Eleventh International Conference on Learning Representations, 2023. URL https://openreview.net/forum?id=SMa9EAovKMC.
  36. Scaling laws for neural language models. arXiv preprint arXiv:2001.08361, 2020.
  37. The stack: 3 tb of permissively licensed source code. Preprint, 2022.
  38. Hypertree proof search for neural theorem proving. arXiv preprint arXiv:2205.11491, 2022.
  39. Solving quantitative reasoning problems with language models. In Alice H. Oh, Alekh Agarwal, Danielle Belgrave, and Kyunghyun Cho (eds.), Advances in Neural Information Processing Systems, 2022.
  40. Let’s verify step by step. arXiv preprint arXiv:2305.20050, 2023.
  41. A survey of deep learning for mathematical reasoning. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp.  14605–14631, Toronto, Canada, July 2023. Association for Computational Linguistics. doi: 10.18653/v1/2023.acl-long.817. URL https://aclanthology.org/2023.acl-long.817.
  42. Wizardmath: Empowering mathematical reasoning for large language models via reinforced evol-instruct. arXiv preprint arXiv:2308.09583, 2023.
  43. The mathlib Community. The lean mathematical library. In Proceedings of the 9th ACM SIGPLAN International Conference on Certified Programs and Proofs, CPP 2020, pp.  367–381, New York, NY, USA, 2020. Association for Computing Machinery. ISBN 9781450370974. doi: 10.1145/3372885.3373824. URL https://doi.org/10.1145/3372885.3373824.
  44. Silo language models: Isolating legal risk in a nonparametric datastore, 2023.
  45. Scott Morrison. lean-training-data. https://github.com/semorrison/lean-training-data, 2023.
  46. OpenAI. Gpt-4 technical report, 2023.
  47. Training language models to follow instructions with human feedback. arXiv preprint arXiv:2203.02155, 2022.
  48. Openwebmath: An open dataset of high-quality mathematical web text. CoRR, abs/2310.06786, 2023. doi: 10.48550/ARXIV.2310.06786. URL https://doi.org/10.48550/arXiv.2310.06786.
  49. Christine Paulin-Mohring. Extracting ω𝜔\omegaitalic_ω’s programs from proofs in the calculus of constructions. In Proceedings of the 16th ACM SIGPLAN-SIGACT symposium on Principles of programming languages, pp.  89–104, 1989a.
  50. Christine Paulin-Mohring. Extraction de programmes dans le Calcul des Constructions. PhD thesis, Université Paris-Diderot-Paris VII, 1989b.
  51. The sledgehammer: Let automatic theorem provers write your isabelle scripts!, 2023. URL https://isabelle.in.tum.de/website-Isabelle2009-1/sledgehammer.html.
  52. Yarn: Efficient context window extension of large language models. arXiv preprint arXiv:2309.00071, 2023.
  53. Generative language modeling for automated theorem proving. arXiv preprint arXiv:2009.03393, 2020.
  54. Formal mathematics statement curriculum learning. arXiv preprint arXiv:2202.01344, 2022.
  55. Language models are unsupervised multitask learners. OpenAI Blog, 2019.
  56. Exploring the limits of transfer learning with a unified text-to-text transformer, 2023.
  57. ZeRO: Memory optimizations toward training trillion parameter models. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, SC ’20. IEEE Press, 2020. ISBN 9781728199986. doi: 10.5555/3433701.3433727. URL https://dl.acm.org/doi/10.5555/3433701.3433727.
  58. Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950, 2023.
  59. Multitask prompted training enables zero-shot task generalization. arXiv preprint arXiv:2110.08207, 2022.
  60. Megatron-LM: Training multi-billion parameter language models using model parallelism. Computing Research Repository, 2019. doi: 10.48550/arXiv.1909.08053. URL https://arxiv.org/abs/1909.08053v4. Version 4.
  61. Large language models encode clinical knowledge, 2022.
  62. Towards expert-level medical question answering with large language models, 2023.
  63. Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:2104.09864, 2022.
  64. Galactica: A large language model for science, 2022.
  65. A language-agent approach to formal theorem-proving, 2023.
  66. Lamda: Language models for dialog applications. arXiv preprint arXiv:2201.08239, 2022.
  67. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288, 2023.
  68. Solving math word problems with process- and outcome-based feedback, 2022.
  69. H. Wang. Toward mechanical mathematics. IBM Journal of Research and Development, 4(1):2–22, 1960. doi: 10.1147/rd.41.0002.
  70. Self-consistency improves chain of thought reasoning in language models. In The Eleventh International Conference on Learning Representations, 2023. URL https://openreview.net/forum?id=1PL1NIMMrw.
  71. Finetuned language models are zero-shot learners. arXiv preprint arXiv:2109.01652, 2022.
  72. Chain-of-thought prompting elicits reasoning in large language models, 2023.
  73. Sean Welleck. Neural theorem proving tutorial. https://github.com/wellecks/ntptutorial, 2023.
  74. llmstep: Llm proofstep suggestions in lean. https://github.com/wellecks/llmstep, 2023.
  75. Naturalprover: Grounded mathematical proof generation with language models. In Alice H. Oh, Alekh Agarwal, Danielle Belgrave, and Kyunghyun Cho (eds.), Advances in Neural Information Processing Systems, 2022. URL https://openreview.net/forum?id=rhdfTOiXBng.
  76. The isabelle framework. In Theorem Proving in Higher Order Logics: 21st International Conference, TPHOLs 2008, Montreal, Canada, August 18-21, 2008. Proceedings 21, pp.  33–38. Springer, 2008.
  77. Bloomberggpt: A large language model for finance, 2023.
  78. Autoformalization with large language models. In Alice H. Oh, Alekh Agarwal, Danielle Belgrave, and Kyunghyun Cho (eds.), Advances in Neural Information Processing Systems, 2022. URL https://openreview.net/forum?id=IUikebJ1Bf0.
  79. Doremi: Optimizing data mixtures speeds up language model pretraining. arXiv preprint arXiv:2305.10429, 2023.
  80. LeanDojo: Theorem proving with retrieval-augmented language models. In Neural Information Processing Systems (NeurIPS), 2023.
  81. BLOOM+1: Adding language support to BLOOM for zero-shot prompting. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp.  11682–11703, Toronto, Canada, July 2023. Association for Computational Linguistics. doi: 10.18653/v1/2023.acl-long.653. URL https://aclanthology.org/2023.acl-long.653.
  82. Metamath: Bootstrap your own mathematical questions for large language models. arXiv preprint arXiv:2309.12284, 2023.
  83. Mammoth: Building math generalist models through hybrid instruction tuning. CoRR, abs/2309.05653, 2023. doi: 10.48550/arXiv.2309.05653. URL https://doi.org/10.48550/arXiv.2309.05653.
  84. Can transformers learn to solve problems recursively?, 2023.
  85. Minif2f: a cross-system benchmark for formal olympiad-level mathematics. arXiv preprint arXiv:2109.00110, 2021.
  86. Teaching algorithmic reasoning via in-context learning, 2022.
  87. Fine-tuning language models from human preferences. arXiv preprint arXiv:1909.08593, 2019.
Citations (209)
List To Do Tasks Checklist Streamline Icon: https://streamlinehq.com

Collections

Sign up for free to add this paper to one or more collections.

Summary

  • The paper introduces Llemma, a specialized language model that significantly outperforms open-base models on mathematical benchmarks.
  • It employs continued pretraining on a 55-billion-token, math-focused dataset with techniques like FlashAttention to enhance reasoning capabilities.
  • Llemma demonstrates robust performance in formal proof generation and tool-augmented problem solving, marking a milestone in math language modeling.

An Overview of Llemma: Optimizing LLMs for Mathematical Problems

The paper introduces "Llemma," a specialized LLM for addressing mathematical problems, built upon the foundation of Code Llama. The development of Llemma is motivated by the requirements of domain-specific LLMs in effectively handling mathematical reasoning tasks, an area where generalist models fall short.

Model and Training

Llemma’s architecture is based on the Code Llama variants with 7 billion and 34 billion parameters. It undergoes further pretraining on a meticulously curated dataset named "Proof," which comprises 55 billion tokens from scientific papers, web data, and code specifically focusing on mathematics. This strategy of continued pretraining is grounded in the hypothesis that mathematical problem-solving demands intensive domain-specific knowledge and reasoning capabilities.

The training methodology involves autoregressive LLMing with optimizations like FlashAttention for enhanced computational efficiency. The dataset for training is thoughtfully composed of scientific literature, quality web content, and a selection of mathematical code that includes formally verified proofs.

Evaluation of Mathematical Proficiency

The paper emphasizes rigorous evaluation over several benchmarks:

  1. Mathematical Problem Solving: Benchmarks such as MATH and GSM8k were employed to test Llemma’s capabilities in solving mathematical problems under conditions without external computational aid.
  2. Tool Augmentation: Llemma’s effectiveness is further explored in scenarios allowing the use of tools, such as Python interpreters for intermediate computational steps, enhancing its problem-solving performance.
  3. Formal Mathematics: Llemma is capable of generating formal proofs using frameworks like Lean and Isabelle, which are critical for validating mathematical reasoning in a more formal and structured environment.

Performance and Competitiveness

The results indicate that Llemma outperforms all open-base models on the MATH benchmark and offers competitive performance when compared to proprietary models like Minerva. Most notably, the continued pretraining on the domain-specific Proof dataset directly contributes to significant improvements over the original Code Llama models.

The open access nature of the Llemma models and datasets encourages further research and development, providing a robust base for future advancements in mathematical LLMing.

Implications and Future Directions

Llemma’s development and public release mark a significant milestone in the application of LLMs to mathematical domains. The implications of this work span both practical applications where precise mathematical reasoning is required and theoretical advancements in model training methodologies. Future directions could involve refining Llemma’s capabilities in diverse mathematical subfields and enhancing its integration with formal verification tools.

In conclusion, Llemma embodies a successful case of domain-specific adaptation in LLMs, offering insights and a framework that can be extrapolated to other specialized domains in AI research.

Ai Generate Text Spark Streamline Icon: https://streamlinehq.com

Paper Prompts

Sign up for free to create and run prompts on this paper using GPT-5.

Dice Question Streamline Icon: https://streamlinehq.com

Follow-up Questions

We haven't generated follow-up questions for this paper yet.

Youtube Logo Streamline Icon: https://streamlinehq.com