Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
38 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Stable LM 2 1.6B Technical Report (2402.17834v1)

Published 27 Feb 2024 in cs.CL and stat.ML
Stable LM 2 1.6B Technical Report

Abstract: We introduce StableLM 2 1.6B, the first in a new generation of our LLM series. In this technical report, we present in detail the data and training procedure leading to the base and instruction-tuned versions of StableLM 2 1.6B. The weights for both models are available via Hugging Face for anyone to download and use. The report contains thorough evaluations of these models, including zero- and few-shot benchmarks, multilingual benchmarks, and the MT benchmark focusing on multi-turn dialogues. At the time of publishing this report, StableLM 2 1.6B was the state-of-the-art open model under 2B parameters by a significant margin. Given its appealing small size, we also provide throughput measurements on a number of edge devices. In addition, we open source several quantized checkpoints and provide their performance metrics compared to the original model.

Introducing Stable LM 2 1.6B: A Compact LLM with Multilingual Capabilities and Open Licensing

Overview

The Stable LM 2 1.6B LLM marks a significant advance in the development of compact, efficient, and openly accessible LLMs. As a successor in the Stable LM series, this model sets a new benchmark for performance in models under 2B parameters. Its design and training have been openly documented, with full transparency regarding the datasets used, training procedures, and performance benchmarks across multiple languages and tasks. This ensures reproducibility and fosters further research within the AI community.

Training and Data

Pre-Training

The model underwent extensive pre-training, employing a diverse set of data sources to enhance its linguistic comprehensiveness and versatility. It uses a standard autoregressive training approach with optimizations for sequence-wise parallelism and is trained from scratch using the efficient FlashAttention-2 mechanism. The chosen datasets span academic sources, books, web content, and specific domains like law and math, totaling approximately 2 trillion tokens. Notably, the training set includes multilingual data, ensuring the model's proficiency across languages. Detailed documentation of the training set, including sampling weights and epochs, ensures transparency and reproducibility.

Fine-Tuning

The fine-tuning process employed supervised learning, direct preference optimization, and self-knowledge learning to refine the model's conversational abilities and align it with human preferences. The use of varied conversational datasets and the exclusion of multilingual data at this stage emphasize the model's focus on developing nuanced language capabilities.

Performance Benchmarks

The model demonstrates exemplary performance across multiple benchmarks, including zero-shot, few-shot, and multilingual evaluations. It not only competes with models twice its size but also sets a new standard for similarly sized open-source LLMs. Its robust multilingual capabilities are evidenced by superior performance in non-English languages seen during pre-training. Additionally, its proficiency in conversational contexts is confirmed by outstanding results on the MT-Bench multi-turn benchmark.

Inference and Quantization

A critical focus of Stable LM 2 1.6B is its efficiency and adaptability for on-device execution. The model has been optimized and quantized for performance on edge devices, with quantization files made available for different inference frameworks. This step is crucial for expanding the applicability of advanced generative capabilities to mobile and consumer-grade hardware without substantial computational overhead.

Future Directions

The paper outlines several avenues for further research, including improvements in data quality, hallucination mitigation, extending context lengths, and exploring conditional computation techniques like Mixture of Experts. These areas promise to enhance the model's performance, further reduce computational requirements, or expand its applicability.

Environmental and Societal Considerations

The report transparently discusses the environmental impact of training Stable LM 2, estimating the carbon footprint based on power consumption and GPU hours. Furthermore, the decision to release the model under an open non-commercial license reflects a commitment to accessibility and responsible use, although it also acknowledges the challenges in assessing the broader societal impacts of such open releases.

Conclusion

Stable LM 2 1.6B represents a balance between performance, efficiency, and accessibility, embodying advancements in LLM training and evaluation. By providing a transparent account of its development process and performance benchmarks, the model contributes valuable insights to the AI community. It encourages further innovation in the development of compact, multilingual, and efficient LLMs that are both powerful and accessible for a wide range of applications.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (88)
  1. Program synthesis with large language models. arXiv preprint arXiv:2108.07732, 2021.
  2. b-mc2. sql-create-context dataset, 2023.
  3. Layer normalization, 2016.
  4. Qwen technical report, 2023.
  5. Open llm leaderboard. https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard, 2023.
  6. Gpt-neox-20b: An open-source autoregressive language model, 2022.
  7. 19th century books - metadata with additional crowdsourced annotations, 2021.
  8. TLDR: Extreme summarization of scientific documents. arXiv:2004.15011, 2020.
  9. Efficient intent detection with dual sentence encoders. In Proceedings of the 2nd Workshop on NLP for ConvAI - ACL 2020, mar 2020. Data available at https://github.com/PolyAI-LDN/task-specific-datasets.
  10. Palm: Scaling language modeling with pathways, 2022.
  11. Think you have solved question answering? try arc, the ai2 reasoning challenge, 2018.
  12. Training verifiers to solve math word problems, 2021.
  13. Together Computer. Redpajama: An open source recipe to reproduce llama training dataset, 2023.
  14. Ultrafeedback: Boosting language models with high-quality feedback, 2023.
  15. Luigi Daniele and Suphavadeeprasit. Amplify-instruct: Synthetically generated diverse multi-turn conversations for effecient llm training. arXiv preprint arXiv:(coming soon), 2023.
  16. Tri Dao. FlashAttention-2: Faster attention with better parallelism and work partitioning, 2023.
  17. FlashAttention: Fast and memory-efficient exact attention with IO-awareness. In Advances in Neural Information Processing Systems, 2022.
  18. Enhancing chat language models by scaling high-quality instructional conversations, 2023.
  19. Glam: Efficient scaling of language models with mixture-of-experts, 2022.
  20. What’s in my big data?, 2023.
  21. Data engineering for scaling language models to 128k context, 2024.
  22. The pile: An 800gb dataset of diverse text for language modeling, 2020.
  23. A framework for few-shot language model evaluation, September 2021.
  24. Openwebtext corpus, 2019.
  25. Accurate, large minibatch sgd: Training imagenet in 1 hour, 2018.
  26. Textbooks are all you need. arXiv preprint arXiv:2306.11644, 2023.
  27. “john is 50 years old, can his son be 65?” evaluating NLP models’ understanding of feasibility. In Andreas Vlachos and Isabelle Augenstein, editors, Proceedings of the 17th Conference of the European Chapter of the Association for Computational Linguistics, pages 407–417, Dubrovnik, Croatia, May 2023. Association for Computational Linguistics.
  28. Efficient parallelization layouts for large-scale distributed model training, 2023.
  29. MLX: Efficient and flexible machine learning on apple silicon, 2023.
  30. Measuring massive multitask language understanding, 2021.
  31. Training compute-optimal large language models, 2022.
  32. Camels in a changing climate: Enhancing lm adaptation with tulu 2, 2023.
  33. Deep Learning-based Code Complexity Prediction, 2022.
  34. triviaqa: A Large Scale Distantly Supervised Challenge Dataset for Reading Comprehension. arXiv e-prints, page arXiv:1705.03551, 2017.
  35. Needle in a haystack - pressure testing llms, 2023.
  36. The NarrativeQA reading comprehension challenge. Transactions of the Association for Computational Linguistics, 6:317–328, 2018.
  37. Sparse upcycling: Training mixture-of-experts from dense checkpoints. arXiv preprint arXiv:2212.05055, 2022.
  38. Okapi: Instruction-tuned large language models in multiple languages with reinforcement learning from human feedback. arXiv preprint arXiv:2307.16039, 2023.
  39. Starcoder: may the source be with you!, 2023.
  40. StarCoder: may the source be with you!, 2023.
  41. Slimorca: An open dataset of gpt-4 augmented flan reasoning traces, with verification, 2023.
  42. CommonGen: A constrained text generation challenge for generative commonsense reasoning. In Findings of the Association for Computational Linguistics: EMNLP 2020, pages 1823–1840, Online, November 2020. Association for Computational Linguistics.
  43. Truthfulqa: Measuring how models mimic human falsehoods, 2022.
  44. Testing the ability of language models to interpret figurative language, 2022.
  45. Tiedong Liu and Bryan Kian Hsiang Low. Goat: Fine-tuned llama outperforms gpt-4 on arithmetic tasks, 2023.
  46. What makes good data for alignment? a comprehensive study of automatic data selection in instruction tuning, 2023.
  47. The data provenance initiative: A large scale audit of dataset licensing & attribution in ai. arXiv preprint arXiv:2310.16787, 2023.
  48. A pretrainer’s guide to training data: Measuring the effects of data age, domain coverage, quality, & toxicity, 2023.
  49. Fingpt: Large generative models for a small language, 2023.
  50. Learning word vectors for sentiment analysis. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies, pages 142–150, Portland, Oregon, USA, June 2011. Association for Computational Linguistics.
  51. An empirical model of large-batch training, 2018.
  52. English WordNet 2019 – an open-source WordNet for English. In Piek Vossen and Christiane Fellbaum, editors, Proceedings of the 10th Global Wordnet Conference, pages 245–252, Wroclaw, Poland, July 2019. Global Wordnet Association.
  53. Scaling data-constrained language models, 2023.
  54. Culturax: A cleaned, enormous, and multilingual dataset for large language models in 167 languages, 2023.
  55. Gpt-4 technical report, 2023.
  56. Openwebmath: An open dataset of high-quality mathematical web text, 2023.
  57. The refinedweb dataset for falcon llm: Outperforming curated corpora with web data, and web data only, 2023.
  58. Improving language understanding by generative pre-training, 2018.
  59. Exploring the limits of transfer learning with a unified text-to-text transformer, 2023.
  60. Coedit: Text editing by task-specific instruction tuning, 2023.
  61. Zero: Memory optimizations toward training trillion parameter models, 2020.
  62. Winogrande: An adversarial winograd schema challenge at scale, 2019.
  63. BIGPATENT: A large-scale dataset for abstractive and coherent summarization. CoRR, abs/1906.03741, 2019.
  64. Stepgame: A new benchmark for robust multi-hop spatial reasoning in texts. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 36, pages 11321–11329, Jun. 2022.
  65. Biosses: a semantic sentence similarity estimation system for the biomedical domain. Bioinformatics, 33(14):i49–i58, 2017.
  66. Roformer: Enhanced transformer with rotary position embedding, 2023.
  67. Llama: Open and efficient foundation language models, 2023.
  68. Stablelm 3b 4e1t, 2023.
  69. The alignment handbook. https://github.com/huggingface/alignment-handbook, 2023.
  70. Zephyr: Direct distillation of lm alignment. arXiv preprint arXiv:2310.16944, 2023.
  71. Attention is all you need. Advances in neural information processing systems, 30, 2017.
  72. Openchat: Advancing open-source language models with mixed-quality data, 2023.
  73. Helpsteer: Multi-attribute helpfulness dataset for steerlm, 2023.
  74. Small-scale proxies for large-scale transformer training instabilities, 2023.
  75. Large-scale cloze test dataset designed by teachers. CoRR, abs/1711.03225, 2017.
  76. Doremi: Optimizing data mixtures speeds up language model pretraining, 2023.
  77. Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244, 2023.
  78. Rlcd: Reinforcement learning from contrast distillation for language model alignment, 2023.
  79. Large batch optimization for deep learning: Training bert in 76 minutes, 2020.
  80. Metamath: Bootstrap your own mathematical questions for large language models. arXiv preprint arXiv:2309.12284, 2023.
  81. restructured pre-training, 2022.
  82. Armel Randy Zebaze. Self-instruct-starcoder.
  83. HellaSwag: Can a machine really finish your sentence? In Anna Korhonen, David Traum, and Lluís Màrquez, editors, Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 4791–4800, Florence, Italy, July 2019. Association for Computational Linguistics.
  84. Root mean square layer normalization, 2019.
  85. Solving and generating npr sunday puzzles with large language models, 2023.
  86. Judging llm-as-a-judge with mt-bench and chatbot arena, 2023.
  87. Temporal reasoning on implicit events from distant supervision. In NAACL, 2021.
  88. Lima: Less is more for alignment, 2023.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (19)
  1. Marco Bellagente (13 papers)
  2. Jonathan Tow (7 papers)
  3. Dakota Mahan (6 papers)
  4. Duy Phung (9 papers)
  5. Maksym Zhuravinskyi (6 papers)
  6. Reshinth Adithyan (4 papers)
  7. James Baicoianu (2 papers)
  8. Ben Brooks (1 paper)
  9. Nathan Cooper (35 papers)
  10. Ashish Datta (2 papers)
  11. Meng Lee (1 paper)
  12. Emad Mostaque (1 paper)
  13. Michael Pieler (10 papers)
  14. Nikhil Pinnaparju (1 paper)
  15. Paulo Rocha (8 papers)
  16. Harry Saini (3 papers)
  17. Hannah Teufel (7 papers)
  18. Carlos Riquelme (26 papers)
  19. Niccolo Zanichelli (1 paper)
Citations (37)