Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
41 tokens/sec
GPT-4o
60 tokens/sec
Gemini 2.5 Pro Pro
44 tokens/sec
o3 Pro
8 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

OLMo: Accelerating the Science of Language Models (2402.00838v4)

Published 1 Feb 2024 in cs.CL
OLMo: Accelerating the Science of Language Models

Abstract: LLMs (LMs) have become ubiquitous in both NLP research and in commercial product offerings. As their commercial importance has surged, the most powerful models have become closed off, gated behind proprietary interfaces, with important details of their training data, architectures, and development undisclosed. Given the importance of these details in scientifically studying these models, including their biases and potential risks, we believe it is essential for the research community to have access to powerful, truly open LMs. To this end, we have built OLMo, a competitive, truly Open LLM, to enable the scientific study of LLMs. Unlike most prior efforts that have only released model weights and inference code, we release OLMo alongside open training data and training and evaluation code. We hope this release will empower the open research community and inspire a new wave of innovation.

The paper "Accelerating the Science of LLMs" explores the importance of having open access to powerful LLMs for the research community. LLMs (LMs) are essential in NLP and have become critical in commercial applications. However, the development of very powerful models is often restricted by proprietary systems that do not disclose vital details about their training data, architecture, and development. These undisclosed details are crucial for scientifically studying these models, which includes understanding their biases and potential risks.

To address this, the paper presents OLMo, a state-of-the-art open LLM that provides not only model weights and inference code but also the entire framework, including training data, and evaluation tools. This comprehensive release is intended to encourage and empower the research community to explore, innovate, and strengthen the understanding of LLMs.

Key Points of the Paper:

  1. The Need for Open LLMs:
    • The paper emphasizes the importance of transparency in the development of LLMs for scientific advancement. Without access to the details of these models, it is challenging to assess their full potential and limitations, especially concerning biases and security risks.
  2. Introduction to OLMo:
    • OLMo is introduced as a new open LLM that offers a full suite of tools and resources, including training data, training procedures, evaluation frameworks, and model weights, to facilitate comprehensive research and innovation in LLMing.
  3. Comparison with Other Models:
    • The paper compares OLMo to other open LLMs like Mistral, LLaMA, Falcon, and BLOOM, which have varying levels of openness. OLMo distinguishes itself by providing a complete framework for paper and development, including intermediate checkpoints and training logs for greater insights into how models evolve.
  4. Technical Specifications:
    • The OLMo model uses a decoder-only transformer architecture. It includes several architectural enhancements over the base transformer architecture, such as non-parametric layer normalization, rotary positional embeddings, and modifications to improve training stability and efficiency.
  5. Dataset and Training:
    • The training dataset, Dolma, is built specifically for open research and includes three trillion tokens from various sources. The dataset is designed to be diverse and reproducible, supporting different research avenues related to the effects of training data on model performance.
  6. Evaluation Methodology:
    • OLMo undergoes extensive evaluation using various tools like Catwalk for downstream evaluation and Paloma for assessing perplexity across different domains. These evaluations aim to provide a clear comparison with publicly available models while ensuring a robust understanding of model capabilities and limitations.
  7. Adaptation and Safety:
    • The paper outlines procedures for adapting the model, including instruction tuning and reinforcement learning from human feedback to ensure the model's safety, performance, and applicability in more diverse contexts.
  8. Environmental Considerations:
    • The environmental impact of training large models is acknowledged, with reported metrics on carbon emissions and energy consumption during the training of OLMo models. The paper suggests that making these models publicly available can help reduce duplicated efforts and contribute to greener AI practices.
  9. Future Work and Releases:
    • Future expansions of OLMo will include advancements in model size, modality, and safety measures. Continued development aims to further support the open research community and explore under-represented areas in the field of LLMing.

The paper concludes by emphasizing the significance of open models in advancing the field of NLP, underlined by the comprehensive release of OLMo's framework and tools, which set a new standard for accessibility and collaborative progress in understanding and utilizing LLMs.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (72)
  1. Semdedup: Data-efficient learning at web-scale through semantic deduplication. arXiv preprint arXiv:2303.09540, 2023. URL https://arxiv.org/abs/2303.09540.
  2. The falcon series of open language models. ArXiv, abs/2311.16867, 2023. URL https://api.semanticscholar.org/CorpusID:265466629.
  3. Layer normalization. ArXiv, abs/1607.06450, 2016. URL https://api.semanticscholar.org/CorpusID:8236317.
  4. A neural probabilistic language model. J. Mach. Learn. Res., 3:1137–1155, 2003. URL https://api.semanticscholar.org/CorpusID:221275765.
  5. Pythia: A suite for analyzing large language models across training and scaling. In Andreas Krause, Emma Brunskill, Kyunghyun Cho, Barbara Engelhardt, Sivan Sabato, and Jonathan Scarlett, editors, Proceedings of the 40th International Conference on Machine Learning, volume 202 of Proceedings of Machine Learning Research, pages 2397–2430. PMLR, 23–29 Jul 2023. URL https://proceedings.mlr.press/v202/biderman23a.html.
  6. Bloom: A 176b-parameter open-access multilingual language model. arXiv preprint arXiv:2211.05100, 2022.
  7. Piqa: Reasoning about physical commonsense in natural language. In Proceedings of the AAAI conference on artificial intelligence, volume 34, pages 7432–7439, 2020. URL https://ojs.aaai.org/index.php/AAAI/article/view/6239.
  8. GPT-NeoX-20B: An open-source autoregressive language model. In Proceedings of the ACL Workshop on Challenges & Perspectives in Creating Large Language Models, 2022. URL https://arxiv.org/abs/2204.06745.
  9. Demographic dialectal variation in social media: A case study of African-American English. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, pages 1119–1130, Austin, Texas, November 2016. Association for Computational Linguistics. doi: 10.18653/v1/D16-1120. URL https://aclanthology.org/D16-1120.
  10. Language models are few-shot learners. ArXiv, abs/2005.14165, 2020. URL https://api.semanticscholar.org/CorpusID:218971783.
  11. Palm: Scaling language modeling with pathways, 2022. URL https://arxiv.org/abs/2204.02311.
  12. Efficient hierarchical domain adaptation for pretrained language models. In Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 1336–1351, Seattle, United States, July 2022. Association for Computational Linguistics. doi: 10.18653/v1/2022.naacl-main.96. URL https://aclanthology.org/2022.naacl-main.96.
  13. Unimax: Fairer and more effective language sampling for large-scale multilingual pretraining. ArXiv, abs/2304.09151, 2023. URL https://api.semanticscholar.org/CorpusID:258187051.
  14. Boolq: Exploring the surprising difficulty of natural yes/no questions. arXiv preprint arXiv:1905.10044, 2019.
  15. Think you have solved question answering? try arc, the ai2 reasoning challenge. arXiv preprint arXiv:1803.05457, 2018. URL https://arxiv.org/abs/1803.05457.
  16. Measuring the carbon intensity of ai in cloud instances, 2022. URL https://dl.acm.org/doi/10.1145/3531146.3533234.
  17. Automatically constructing a corpus of sentential paraphrases. In International Joint Conference on Natural Language Processing, 2005. URL https://www.microsoft.com/en-us/research/publication/automatically-constructing-a-corpus-of-sentential-paraphrases/.
  18. What’s in my big data? ArXiv, abs/2310.20707, 2023. URL https://api.semanticscholar.org/CorpusID:264803575.
  19. The pile: An 800gb dataset of diverse text for language modeling. arXiv preprint arXiv:2101.00027, 2020. URL https://arxiv.org/abs/2101.00027.
  20. A framework for few-shot language model evaluation, 12 2023. URL https://zenodo.org/records/10256836.
  21. The international corpus of english (ICE) project. World Englishes, 15(1):3–15, mar 1996. doi: 10.1111/j.1467-971x.1996.tb00088.x. URL https://doi.org/10.1111%2Fj.1467-971x.1996.tb00088.x.
  22. Catwalk: A unified language model evaluation framework for many datasets. arXiv preprint arXiv:2312.10253, 2023. URL https://arxiv.org/abs/2312.10253.
  23. OpenLM: a minimal but performative language modeling (lm) repository, 2023. URL https://github.com/mlfoundations/open_lm/. GitHub repository.
  24. Camels in a changing climate: Enhancing lm adaptation with tulu 2, 2023. URL https://arxiv.org/abs/2311.10702.
  25. Mixtral of experts. arXiv preprint arXiv:2401.04088, 2024. URL https://arxiv.org/abs/2401.04088.
  26. Holistic evaluation of language models. arXiv preprint arXiv:2211.09110, 2022. URL https://arxiv.org/abs/2211.09110.
  27. Logiqa: A challenge dataset for machine reading comprehension with logical reasoning. CoRR, abs/2007.08124, 2020. URL https://arxiv.org/abs/2007.08124.
  28. Llm360: Towards fully transparent open-source llms. arXiv preprint arXiv:2312.06550, 2023. URL https://arxiv.org/abs/2312.06550.
  29. Decoupled weight decay regularization. In International Conference on Learning Representations, 2019. URL https://openreview.net/forum?id=Bkg6RiCqY7.
  30. Estimating the carbon footprint of bloom, a 176b parameter language model, 2022. URL https://arxiv.org/abs/2211.02001.
  31. Paloma: A benchmark for evaluating language model fit. arXiv preprint arXiv:2312.10523, 2023.
  32. Treebank-3, 1999. URL https://catalog.ldc.upenn.edu/LDC99T42.
  33. Pointer sentinel mixture models. ArXiv, abs/1609.07843, 2016. URL https://api.semanticscholar.org/CorpusID:16299141.
  34. Mixed precision training. ArXiv, abs/1710.03740, 2017. URL https://api.semanticscholar.org/CorpusID:3297437.
  35. Can a suit of armor conduct electricity? a new dataset for open book question answering. arXiv preprint arXiv:1809.02789, 2018. URL https://arxiv.org/abs/1809.02789.
  36. Distributed representations of words and phrases and their compositionality. In Neural Information Processing Systems, 2013. URL https://api.semanticscholar.org/CorpusID:16447573.
  37. MosaicML NLP Team. Introducing mpt-7b: A new standard for open-source, commercially usable llms, 2023. URL www.mosaicml.com/blog/mpt-7b. Accessed: 2023-05-05.
  38. Scaling data-constrained language models. arXiv preprint arXiv:2305.16264, 2023.
  39. Davide Nunes. Preprocessed penn tree bank, 2020. URL https://zenodo.org/record/3910021.
  40. OpenAI. Gpt-4 technical report. ArXiv, abs/2303.08774, 2023. URL https://api.semanticscholar.org/CorpusID:257532815.
  41. Raiders of the lost kek: 3.5 years of augmented 4chan posts from the politically incorrect board. Proceedings of the International AAAI Conference on Web and Social Media, 14:885–894, may 2020. doi: 10.1609/icwsm.v14i1.7354. URL https://doi.org/10.1609%2Ficwsm.v14i1.7354.
  42. Carbon emissions and large neural network training, 2021. URL https://arxiv.org/abs/2104.10350.
  43. The refinedweb dataset for falcon llm: Outperforming curated corpora with web data, and web data only. ArXiv, abs/2306.01116, 2023. URL https://api.semanticscholar.org/CorpusID:259063761.
  44. Deep contextualized word representations. ArXiv, abs/1802.05365, 2018. URL https://api.semanticscholar.org/CorpusID:3626819.
  45. Wic: 10, 000 example pairs for evaluating context-sensitive representations. CoRR, abs/1808.09121, 2018. URL http://arxiv.org/abs/1808.09121.
  46. Scaling language models: Methods, analysis & insights from training gopher, 2022. URL https://arxiv.org/abs/2112.11446.
  47. Exploring the limits of transfer learning with a unified text-to-text transformer. J. Mach. Learn. Res., 21(1), jan 2020. ISSN 1532-4435.
  48. Zero: Memory optimizations toward training trillion parameter models. SC20: International Conference for High Performance Computing, Networking, Storage and Analysis, pages 1–16, 2019. URL https://api.semanticscholar.org/CorpusID:203736482.
  49. M2D2: A massively multi-domain language modeling dataset. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 964–975, Abu Dhabi, United Arab Emirates, December 2022. Association for Computational Linguistics. URL https://aclanthology.org/2022.emnlp-main.63.
  50. The evolution of the manosphere across the web. Proceedings of the International AAAI Conference on Web and Social Media, 15:196–207, may 2021. doi: 10.1609/icwsm.v15i1.18053. URL https://doi.org/10.1609%2Ficwsm.v15i1.18053.
  51. Choice of plausible alternatives: An evaluation of commonsense causal reasoning. In 2011 AAAI Spring Symposium Series, 2011. URL https://aaai.org/papers/02418-2418-choice-of-plausible-alternatives-an-evaluation-of-commonsense-causal-reasoning/.
  52. Ronald Rosenfeld. Two decades of statistical language modeling: Where do we go from here? Proceedings of the IEEE, 88(8):1270–1278, 2000.
  53. Winogrande: An adversarial winograd schema challenge at scale. Communications of the ACM, 64(9):99–106, 2021. URL https://dl.acm.org/doi/abs/10.1145/3474381.
  54. Noam M. Shazeer. Glu variants improve transformer. ArXiv, abs/2002.05202, 2020. URL https://api.semanticscholar.org/CorpusID:211096588.
  55. Dolma: An Open Corpus of Three Trillion Tokens for Language Model Pretraining Research. arXiv preprint, 2024.
  56. Energy and policy considerations for deep learning in NLP. In Anna Korhonen, David Traum, and Lluís Màrquez, editors, Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 3645–3650, Florence, Italy, July 2019. Association for Computational Linguistics. doi: 10.18653/v1/P19-1355. URL https://aclanthology.org/P19-1355.
  57. Roformer: Enhanced transformer with rotary position embedding. ArXiv, abs/2104.09864, 2021. URL https://api.semanticscholar.org/CorpusID:233307138.
  58. Together Computer. RedPajama: An Open Source Recipe to Reproduce LLaMA training dataset, April 2023. URL https://github.com/togethercomputer/RedPajama-Data.
  59. Llama: Open and efficient foundation language models. ArXiv, abs/2302.13971, 2023a. URL https://api.semanticscholar.org/CorpusID:257219404.
  60. Llama 2: Open foundation and fine-tuned chat models, 2023b. URL https://arxiv.org/abs/2307.09288.
  61. Water Security and Climate Change: Hydropower Reservoir Greenhouse Gas Emissions, pages 69–94. Springer Singapore, Singapore, 2022. ISBN 978-981-16-5493-0. doi: 10.1007/978-981-16-5493-0˙5. URL https://doi.org/10.1007/978-981-16-5493-0_5.
  62. Attention is all you need. In I. Guyon, U. Von Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, editors, Advances in Neural Information Processing Systems, volume 30. Curran Associates, Inc., 2017. URL https://proceedings.neurips.cc/paper_files/paper/2017/file/3f5ee243547dee91fbd053c1c4a845aa-Paper.pdf.
  63. HEAD-QA: A healthcare dataset for complex reasoning. In Anna Korhonen, David Traum, and Lluís Màrquez, editors, Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 960–966, Florence, Italy, July 2019. Association for Computational Linguistics. doi: 10.18653/v1/P19-1092. URL https://aclanthology.org/P19-1092.
  64. Glue: A multi-task benchmark and analysis platform for natural language understanding. ArXiv, abs/1804.07461, 2018. URL https://arxiv.org/abs/1804.07461.
  65. How far can camels go? exploring the state of instruction tuning on open resources, 2023. URL https://arxiv.org/abs/2306.04751.
  66. Crowdsourcing multiple choice science questions. arXiv preprint arXiv:1707.06209, 2017. URL https://arxiv.org/abs/1707.06209.
  67. Sustainable ai: Environmental implications, challenges and opportunities, 2022. URL https://arxiv.org/abs/2111.00364.
  68. What is gab: A bastion of free speech or an alt-right echo chamber. In Companion Proceedings of the The Web Conference 2018, WWW ’18, page 1007–1014, Republic and Canton of Geneva, CHE, 2018. International World Wide Web Conferences Steering Committee. ISBN 9781450356404. doi: 10.1145/3184558.3191531. URL https://doi.org/10.1145/3184558.3191531.
  69. Hellaswag: Can a machine really finish your sentence? arXiv preprint arXiv:1905.07830, 2019. URL https://arxiv.org/abs/1905.07830.
  70. Root mean square layer normalization. ArXiv, abs/1910.07467, 2019. URL https://api.semanticscholar.org/CorpusID:113405151.
  71. Opt: Open pre-trained transformer language models, 2022. URL https://arxiv.org/abs/2205.01068.
  72. Pytorch fsdp: Experiences on scaling fully sharded data parallel. Proc. VLDB Endow., 16:3848–3860, 2023. URL https://api.semanticscholar.org/CorpusID:258297871.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (43)
  1. Dirk Groeneveld (19 papers)
  2. Iz Beltagy (39 papers)
  3. Pete Walsh (9 papers)
  4. Akshita Bhagia (12 papers)
  5. Rodney Kinney (8 papers)
  6. Oyvind Tafjord (49 papers)
  7. Ananya Harsh Jha (8 papers)
  8. Hamish Ivison (14 papers)
  9. Ian Magnusson (12 papers)
  10. Yizhong Wang (42 papers)
  11. Shane Arora (8 papers)
  12. David Atkinson (33 papers)
  13. Russell Authur (4 papers)
  14. Khyathi Raghavi Chandu (24 papers)
  15. Arman Cohan (121 papers)
  16. Jennifer Dumas (2 papers)
  17. Yanai Elazar (44 papers)
  18. Yuling Gu (16 papers)
  19. Jack Hessel (50 papers)
  20. Tushar Khot (53 papers)
Citations (225)
Youtube Logo Streamline Icon: https://streamlinehq.com