2000 character limit reached
Llemma: An Open Language Model For Mathematics (2310.10631v3)
Published 16 Oct 2023 in cs.CL, cs.AI, and cs.LO
Abstract: We present Llemma, a LLM for mathematics. We continue pretraining Code Llama on the Proof-Pile-2, a mixture of scientific papers, web data containing mathematics, and mathematical code, yielding Llemma. On the MATH benchmark Llemma outperforms all known open base models, as well as the unreleased Minerva model suite on an equi-parameter basis. Moreover, Llemma is capable of tool use and formal theorem proving without any further finetuning. We openly release all artifacts, including 7 billion and 34 billion parameter models, the Proof-Pile-2, and code to replicate our experiments.
- Santacoder: don’t reach for the stars! In Deep Learning for Code (DL4C) Workshop, 2023.
- GPT-NeoX: Large scale autoregressive language modeling in PyTorch. GitHub Repo, 9 2023. URL https://www.github.com/eleutherai/gpt-neox.
- Retrieval-based language models and applications. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 6: Tutorial Abstracts), pp. 41–46, Toronto, Canada, July 2023. Association for Computational Linguistics. doi: 10.18653/v1/2023.acl-tutorials.6. URL https://aclanthology.org/2023.acl-tutorials.6.
- Jeremy Avigad. The mechanization of mathematics. Notices of the AMS, 65(6):681–90, 2018.
- Proofnet: Autoformalizing and formally proving undergraduate-level mathematics. ArXiv, abs/2302.12433, 2023.
- Training a helpful and harmless assistant with reinforcement learning from human feedback. arXiv preprint arXiv:2204.05862, 2022.
- SciBERT: A pretrained language model for scientific text. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pp. 3615–3620, Hong Kong, China, November 2019. Association for Computational Linguistics. doi: 10.18653/v1/D19-1371. URL https://aclanthology.org/D19-1371.
- Pythia: A suite for analyzing large language models across training and scaling. In International Conference on Machine Learning, pp. 2397–2430. PMLR, 2023.
- Datasheet for the pile. ArXiv, abs/2201.07311, 2022.
- Gpt-neox-20b: An open-source autoregressive language model. In Proceedings of BigScience Episode# 5–Workshop on Challenges & Perspectives in Creating Large Language Models, pp. 95–136, 2022.
- Measuring progress on scalable oversight for large language models. arXiv preprint arXiv:2211.03540, 2022.
- Language models are few-shot learners. ArXiv, abs/2005.14165, 2020.
- Palm: Scaling language modeling with pathways. arXiv preprint arXiv:2204.02311, 2022.
- Training verifiers to solve math word problems. arXiv preprint arXiv:2110.14168, 2021.
- Evaluating language models for mathematics through interactions. arXiv preprint arXiv:2306.01694, 2023.
- Together Computer. Redpajama: An open source recipe to reproduce llama training dataset, April 2023. URL https://github.com/togethercomputer/RedPajama-Data.
- Tri Dao. Flashattention-2: Faster attention with better parallelism and work partitioning. arXiv preprint arXiv:2307.08691, 2023.
- The lean theorem prover (system description). In Automated Deduction-CADE-25: 25th International Conference on Automated Deduction, Berlin, Germany, August 1-7, 2015, Proceedings 25, pp. 378–388. Springer, 2015.
- The adventure of the errant hardware, 2023. URL https://www.adept.ai/blog/sherlock-sdc.
- Baldur: Whole-proof generation and repair with large language models. arXiv preprint arXiv:2303.04910, 2023.
- The pile: An 800gb dataset of diverse text for language modeling. ArXiv, abs/2101.00027, 2020.
- A framework for few-shot language model evaluation, September 2021. URL https://doi.org/10.5281/zenodo.5371628.
- Pal: Program-aided language models. arXiv preprint arXiv:2211.10435, 2022.
- Pal: Program-aided language models. arXiv preprint arXiv:2211.10435, 2023.
- Datasheets for datasets, 2021.
- Herbert L. Gelernter. Realization of a geometry theorem proving machine. In IFIP Congress, 1959. URL https://api.semanticscholar.org/CorpusID:18484295.
- Accurate, large minibatch SGD: training imagenet in 1 hour. CoRR, abs/1706.02677, 2017. URL http://arxiv.org/abs/1706.02677.
- Proof artifact co-training for theorem proving with language models. In International Conference on Learning Representations, 2022. URL https://openreview.net/forum?id=rpxJc9j04U.
- Measuring massive multitask language understanding. arXiv preprint arXiv:2009.03300, 2021a.
- Measuring mathematical problem solving with the math dataset. NeurIPS, 2021b.
- Training compute-optimal large language models. arXiv preprint arXiv:2203.15556, 2022.
- The curious case of neural text degeneration, 2020.
- Lisa: Language models of isabelle proofs. 6th Conference on Artificial Intelligence and Theorem Proving, 2021.
- Thor: Wielding hammers to integrate language models and automated theorem provers. arXiv preprint arXiv:2205.10893, 2022.
- Draft, sketch, and prove: Guiding formal theorem provers with informal proofs. In The Eleventh International Conference on Learning Representations, 2023. URL https://openreview.net/forum?id=SMa9EAovKMC.
- Scaling laws for neural language models. arXiv preprint arXiv:2001.08361, 2020.
- The stack: 3 tb of permissively licensed source code. Preprint, 2022.
- Hypertree proof search for neural theorem proving. arXiv preprint arXiv:2205.11491, 2022.
- Solving quantitative reasoning problems with language models. In Alice H. Oh, Alekh Agarwal, Danielle Belgrave, and Kyunghyun Cho (eds.), Advances in Neural Information Processing Systems, 2022.
- Let’s verify step by step. arXiv preprint arXiv:2305.20050, 2023.
- A survey of deep learning for mathematical reasoning. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 14605–14631, Toronto, Canada, July 2023. Association for Computational Linguistics. doi: 10.18653/v1/2023.acl-long.817. URL https://aclanthology.org/2023.acl-long.817.
- Wizardmath: Empowering mathematical reasoning for large language models via reinforced evol-instruct. arXiv preprint arXiv:2308.09583, 2023.
- The mathlib Community. The lean mathematical library. In Proceedings of the 9th ACM SIGPLAN International Conference on Certified Programs and Proofs, CPP 2020, pp. 367–381, New York, NY, USA, 2020. Association for Computing Machinery. ISBN 9781450370974. doi: 10.1145/3372885.3373824. URL https://doi.org/10.1145/3372885.3373824.
- Silo language models: Isolating legal risk in a nonparametric datastore, 2023.
- Scott Morrison. lean-training-data. https://github.com/semorrison/lean-training-data, 2023.
- OpenAI. Gpt-4 technical report, 2023.
- Training language models to follow instructions with human feedback. arXiv preprint arXiv:2203.02155, 2022.
- Openwebmath: An open dataset of high-quality mathematical web text. CoRR, abs/2310.06786, 2023. doi: 10.48550/ARXIV.2310.06786. URL https://doi.org/10.48550/arXiv.2310.06786.
- Christine Paulin-Mohring. Extracting ω𝜔\omegaitalic_ω’s programs from proofs in the calculus of constructions. In Proceedings of the 16th ACM SIGPLAN-SIGACT symposium on Principles of programming languages, pp. 89–104, 1989a.
- Christine Paulin-Mohring. Extraction de programmes dans le Calcul des Constructions. PhD thesis, Université Paris-Diderot-Paris VII, 1989b.
- The sledgehammer: Let automatic theorem provers write your isabelle scripts!, 2023. URL https://isabelle.in.tum.de/website-Isabelle2009-1/sledgehammer.html.
- Yarn: Efficient context window extension of large language models. arXiv preprint arXiv:2309.00071, 2023.
- Generative language modeling for automated theorem proving. arXiv preprint arXiv:2009.03393, 2020.
- Formal mathematics statement curriculum learning. arXiv preprint arXiv:2202.01344, 2022.
- Language models are unsupervised multitask learners. OpenAI Blog, 2019.
- Exploring the limits of transfer learning with a unified text-to-text transformer, 2023.
- ZeRO: Memory optimizations toward training trillion parameter models. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, SC ’20. IEEE Press, 2020. ISBN 9781728199986. doi: 10.5555/3433701.3433727. URL https://dl.acm.org/doi/10.5555/3433701.3433727.
- Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950, 2023.
- Multitask prompted training enables zero-shot task generalization. arXiv preprint arXiv:2110.08207, 2022.
- Megatron-LM: Training multi-billion parameter language models using model parallelism. Computing Research Repository, 2019. doi: 10.48550/arXiv.1909.08053. URL https://arxiv.org/abs/1909.08053v4. Version 4.
- Large language models encode clinical knowledge, 2022.
- Towards expert-level medical question answering with large language models, 2023.
- Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:2104.09864, 2022.
- Galactica: A large language model for science, 2022.
- A language-agent approach to formal theorem-proving, 2023.
- Lamda: Language models for dialog applications. arXiv preprint arXiv:2201.08239, 2022.
- Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288, 2023.
- Solving math word problems with process- and outcome-based feedback, 2022.
- H. Wang. Toward mechanical mathematics. IBM Journal of Research and Development, 4(1):2–22, 1960. doi: 10.1147/rd.41.0002.
- Self-consistency improves chain of thought reasoning in language models. In The Eleventh International Conference on Learning Representations, 2023. URL https://openreview.net/forum?id=1PL1NIMMrw.
- Finetuned language models are zero-shot learners. arXiv preprint arXiv:2109.01652, 2022.
- Chain-of-thought prompting elicits reasoning in large language models, 2023.
- Sean Welleck. Neural theorem proving tutorial. https://github.com/wellecks/ntptutorial, 2023.
- llmstep: Llm proofstep suggestions in lean. https://github.com/wellecks/llmstep, 2023.
- Naturalprover: Grounded mathematical proof generation with language models. In Alice H. Oh, Alekh Agarwal, Danielle Belgrave, and Kyunghyun Cho (eds.), Advances in Neural Information Processing Systems, 2022. URL https://openreview.net/forum?id=rhdfTOiXBng.
- The isabelle framework. In Theorem Proving in Higher Order Logics: 21st International Conference, TPHOLs 2008, Montreal, Canada, August 18-21, 2008. Proceedings 21, pp. 33–38. Springer, 2008.
- Bloomberggpt: A large language model for finance, 2023.
- Autoformalization with large language models. In Alice H. Oh, Alekh Agarwal, Danielle Belgrave, and Kyunghyun Cho (eds.), Advances in Neural Information Processing Systems, 2022. URL https://openreview.net/forum?id=IUikebJ1Bf0.
- Doremi: Optimizing data mixtures speeds up language model pretraining. arXiv preprint arXiv:2305.10429, 2023.
- LeanDojo: Theorem proving with retrieval-augmented language models. In Neural Information Processing Systems (NeurIPS), 2023.
- BLOOM+1: Adding language support to BLOOM for zero-shot prompting. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 11682–11703, Toronto, Canada, July 2023. Association for Computational Linguistics. doi: 10.18653/v1/2023.acl-long.653. URL https://aclanthology.org/2023.acl-long.653.
- Metamath: Bootstrap your own mathematical questions for large language models. arXiv preprint arXiv:2309.12284, 2023.
- Mammoth: Building math generalist models through hybrid instruction tuning. CoRR, abs/2309.05653, 2023. doi: 10.48550/arXiv.2309.05653. URL https://doi.org/10.48550/arXiv.2309.05653.
- Can transformers learn to solve problems recursively?, 2023.
- Minif2f: a cross-system benchmark for formal olympiad-level mathematics. arXiv preprint arXiv:2109.00110, 2021.
- Teaching algorithmic reasoning via in-context learning, 2022.
- Fine-tuning language models from human preferences. arXiv preprint arXiv:1909.08593, 2019.
Collections
Sign up for free to add this paper to one or more collections.
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.