Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
38 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

LLM360: Towards Fully Transparent Open-Source LLMs (2312.06550v1)

Published 11 Dec 2023 in cs.CL, cs.AI, and cs.LG
LLM360: Towards Fully Transparent Open-Source LLMs

Abstract: The recent surge in open-source LLMs, such as LLaMA, Falcon, and Mistral, provides diverse options for AI practitioners and researchers. However, most LLMs have only released partial artifacts, such as the final model weights or inference code, and technical reports increasingly limit their scope to high-level design choices and surface statistics. These choices hinder progress in the field by degrading transparency into the training of LLMs and forcing teams to rediscover many details in the training process. We present LLM360, an initiative to fully open-source LLMs, which advocates for all training code and data, model checkpoints, and intermediate results to be made available to the community. The goal of LLM360 is to support open and collaborative AI research by making the end-to-end LLM training process transparent and reproducible by everyone. As a first step of LLM360, we release two 7B parameter LLMs pre-trained from scratch, Amber and CrystalCoder, including their training code, data, intermediate checkpoints, and analyses (at https://www.LLM360.ai). We are committed to continually pushing the boundaries of LLMs through this open-source effort. More large-scale and stronger models are underway and will be released in the future.

Introduction

The paper introduces LLM360, an initiative to enhance the transparency of LLMs by promoting the open-sourcing of comprehensive training details. The initiative underscores the recent trend in restricting access to training processes of LLMs, creating hurdles to replicability and innovation. LLM360 aims to reverse this trend by advocating the sharing of training codes, data, model checkpoints, and analyses. As part of this initiative, the paper highlights the release of two LLMs, AMBER and CRYSTAL CODER, accompanied by extensive training materials made available to the public.

Transparency and Challenges in LLM Research

The open-sourcing philosophy behind LLM360 extends from model weights to training codes and the nuanced details involved in the creation of LLMs. This approach is designed to combat several challenges faced in the LLM field, such as:

  • Data provenance and the consequential understanding of training data to mitigate biases.
  • Reproducibility hurdles due to the non-disclosure of full training configurations, impediments in validating reported results.
  • The barrier to open collaboration caused by the release of only final model weights, which limits research into emergent abilities or training data effects on LLM behavior.

LLM360 Framework and Initial Model Releases

LLM360 focuses on a complete open-source effort that includes all training components, intermediate checkpoints, model configurations, and data origins. Specifically, the paper discusses the introduction of AMBER and CRYSTAL CODER, LLMs trained from scratch with respective parameter scales of 7B, showcasing their development details, data sources, and training methodologies. The framework embodies model transparency from code, training procedures, to intermediate checkpoints, aiming to set standards for future model releases.

Future Directions and Conclusion

Looking ahead, LLM360 promises the publication of larger, more powerful models while maintaining open-source principles. The initiative paves the way for continuous research collaboration and methodological development, aiming to address better training data mixtures, filtering techniques, and optimization strategies. The paper concludes with a commitment to the LLM360 vision of propelling sophistication and openness in LLM pre-training domains while acknowledging the need for responsible use, risk management, and community engagement.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (50)
  1. OpenAI. Gpt-4 technical report, 2023.
  2. Claude. Claude 2.1 model card. Technical report, Claude Inc., 2023.
  3. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971, 2023.
  4. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288, 2023.
  5. The refinedweb dataset for falcon llm: outperforming curated corpora with web data, and web data only. arXiv preprint arXiv:2306.01116, 2023.
  6. Mistral 7b. arXiv preprint arXiv:2310.06825, 2023.
  7. Skywork: A more open bilingual foundation model, 2023.
  8. Don’t make your llm an evaluation benchmark cheater, 2023.
  9. Together Computer. Redpajama: an open dataset for training large language models, 2023.
  10. Together Computer. Redpajama-incite-7b-base, 2023.
  11. Openllama: An open reproduction of llama, May 2023.
  12. Emergent and predictable memorization in large language models. arXiv preprint arXiv:2304.11158, 2023.
  13. Emergent abilities of large language models, 2022.
  14. Large language model as attributed training data generator: A tale of diversity and bias, 2023.
  15. Doremi: Optimizing data mixtures speeds up language model pretraining. arXiv preprint arXiv:2305.10429, 2023.
  16. Pythia: A suite for analyzing large language models across training and scaling. In International Conference on Machine Learning, pages 2397–2430. PMLR, 2023.
  17. Open llm leaderboard. https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard, 2023.
  18. GPT-J-6B: A 6 Billion Parameter Autoregressive Language Model. https://github.com/kingoflolz/mesh-transformer-jax, May 2021.
  19. Gpt-neox-20b: An open-source autoregressive language model, 2022.
  20. Opt: Open pre-trained transformer language models. arXiv preprint arXiv:2205.01068, 2022.
  21. Bloom: A 176b-parameter open-access multilingual language model. arXiv preprint arXiv:2211.05100, 2022.
  22. MosaicML NLP Team. Introducing mpt-7b: A new standard for open-source, commercially usable llms, 2023. Accessed: 2023-05-05.
  23. The falcon series of open language models, 2023.
  24. Qwen technical report. arXiv preprint arXiv:2309.16609, 2023.
  25. 01.ai. 01-ai/yi: A series of large language models trained from scratch by developers @01-ai, 2023.
  26. Slimpajama-dc: Understanding data combinations for llm training, 2023.
  27. GPT-NeoX: Large Scale Autoregressive Language Modeling in PyTorch, 9 2023.
  28. Scaling laws and interpretability of learning from repeated data, 2022.
  29. Efficient large-scale language model training on gpu clusters using megatron-lm. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, pages 1–15, 2021.
  30. Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:2104.09864, 2021.
  31. Llm.int8(): 8-bit matrix multiplication for transformers at scale, 2022.
  32. Llm-qat: Data-free quantization aware training for large language models, 2023.
  33. Adam: A method for stochastic optimization, 2017.
  34. Glm-130b: An open bilingual pre-trained model. arXiv preprint arXiv:2210.02414, 2023.
  35. Mixed precision training. arXiv preprint arXiv:1710.03740, 2017.
  36. Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244, 2023.
  37. Judging llm-as-a-judge with mt-bench and chatbot arena, 2023.
  38. Pytorch fsdp: Experiences on scaling fully sharded data parallel, 2023.
  39. Direct preference optimization: Your language model is secretly a reward model, 2023.
  40. Beavertails: Towards improved safety alignment of llm via a human-preference dataset, 2023.
  41. SlimPajama: A 627B token cleaned and deduplicated version of RedPajama. https://www.cerebras.net/blog/slimpajama-a-627b-token-cleaned-and-deduplicated-version-of-redpajama, 2023.
  42. Starcoder: may the source be with you! arXiv preprint arXiv:2305.06161, 2023.
  43. Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950, 2023.
  44. Tensor programs v: Tuning large neural networks via zero-shot hyperparameter transfer. arXiv preprint arXiv:2203.03466, 2022.
  45. Extracting training data from large language models. In 30th USENIX Security Symposium (USENIX Security 21), pages 2633–2650, 2021.
  46. Quantifying memorization across neural language models. arXiv preprint arXiv:2202.07646, 2022.
  47. Pufferfish: Communication-efficient models at no extra cost. Proceedings of Machine Learning and Systems, 3:365–386, 2021.
  48. Cuttlefish: Low-rank model training without all the tuning. Proceedings of Machine Learning and Systems, 5, 2023.
  49. Taxonomy of risks posed by language models. In Proceedings of the 2022 ACM Conference on Fairness, Accountability, and Transparency, FAccT ’22, page 214–229, New York, NY, USA, 2022. Association for Computing Machinery.
  50. Bold: Dataset and metrics for measuring biases in open-ended language generation. In Proceedings of the 2021 ACM Conference on Fairness, Accountability, and Transparency, FAccT ’21. ACM, March 2021.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (28)
  1. Zhengzhong Liu (28 papers)
  2. Aurick Qiao (9 papers)
  3. Willie Neiswanger (68 papers)
  4. Hongyi Wang (62 papers)
  5. Bowen Tan (23 papers)
  6. Tianhua Tao (10 papers)
  7. Junbo Li (35 papers)
  8. Yuqi Wang (62 papers)
  9. Suqi Sun (2 papers)
  10. Omkar Pangarkar (2 papers)
  11. Richard Fan (11 papers)
  12. Yi Gu (69 papers)
  13. Victor Miller (5 papers)
  14. Yonghao Zhuang (10 papers)
  15. Guowei He (19 papers)
  16. Haonan Li (43 papers)
  17. Fajri Koto (47 papers)
  18. Liping Tang (23 papers)
  19. Nikhil Ranjan (3 papers)
  20. Zhiqiang Shen (172 papers)
Citations (53)
Youtube Logo Streamline Icon: https://streamlinehq.com
Reddit Logo Streamline Icon: https://streamlinehq.com