LLM360 K2: Scaling Up 360-Open-Source Large Language Models (2501.07124v1)

Published 13 Jan 2025 in cs.LG

Abstract: We detail the training of the LLM360 K2-65B model, scaling up our 360-degree OPEN SOURCE approach to the largest and most powerful models under project LLM360. While open-source LLMs continue to advance, the answer to "How are the largest LLMs trained?" remains unclear within the community. The implementation details for such high-capacity models are often protected due to business considerations associated with their high cost. This lack of transparency prevents LLM researchers from leveraging valuable insights from prior experience, e.g., "What are the best practices for addressing loss spikes?" The LLM360 K2 project addresses this gap by providing full transparency and access to resources accumulated during the training of LLMs at the largest scale. This report highlights key elements of the K2 project, including our first model, K2 DIAMOND, a 65 billion-parameter LLM that surpasses LLaMA-65B and rivals LLaMA2-70B, while requiring fewer FLOPs and tokens. We detail the implementation steps and present a longitudinal analysis of K2 DIAMOND's capabilities throughout its training process. We also outline ongoing projects such as TXT360, setting the stage for future models in the series. By offering previously unavailable resources, the K2 project also resonates with the 360-degree OPEN SOURCE principles of transparency, reproducibility, and accessibility, which we believe are vital in the era of resource-intensive AI research.

PDF Abstract

Overview of LLM360 K2: Scaling Up 360-Open-Source LLMs

The paper focuses on the LLM360 K2 project, highlighting the development of K2 Diamond, a 65 billion-parameter LLM. This model is part of the LLM360 initiative, which aims to provide fully transparent and reproducible open-source models. The paper addresses the common challenge in the LLM community regarding the lack of transparency in training large-scale models due to business considerations. The LLM360 K2 project seeks to fill this gap by offering extensive resources and insights into the training process of high-capacity models, thereby promoting principles of transparency, reproducibility, and accessibility.

Key Contributions

Model Development and Comparison: K2 Diamond, the primary model of the K2 project, achieves superior performance compared to other models like LLaMA-65B and LLaMA2-70B, despite utilizing fewer FLOPs and tokens. The model is trained on 1.4 trillion tokens, drawing from a diverse dataset that includes web data, high-quality textbooks, and domain-specific content.
Comprehensive Documentation: The project meticulously documents the pretraining process, hyperparameters, training algorithms, and model architecture. This includes a detailed account of the data curation process, showcasing the fully open-source approach and setting a benchmark in LLM research for its scale and performance.
Transparent Artifacts: LLM360 K2 releases a wide array of resources, including model checkpoints, training logs, and raw evaluation outputs. These are intended to aid in understanding the model's evolution and facilitating reproducibility.
Observations on Loss Spikes: During training, the team observed loss spikes, which they categorized as benign or malignant based on their impact on model performance. This documentation provides valuable insights for future research on training stability.
Open-Source Principles: The paper emphasizes the significance of reproducibility, transparency, and accessibility. The entire developmental lifecycle of the LLM, from data curation to post-training optimization, adheres to these principles, setting an exemplary standard for future projects.

Implications and Future Work

The transparency provided by the LLM360 K2 project has significant implications for AI research and practical applications. By enabling a comprehensive understanding of the trainability of LLMs, the project fosters an environment of innovation and collaboration. Future developments could involve exploring even larger models and diverse datasets, while maintaining the commitment to an open-source ethos.

The project's contribution in terms of offering open access to model training data and methodologies not only aids in the democratization of AI research but also provides a robust foundation for ethical considerations in AI development. As the trend towards transparency grows, the potential for ethically aligned and efficient AI systems increases significantly.

Moreover, the documentation of challenges, such as loss spikes, presents new avenues for research, particularly in improving the stability and efficiency of training processes. This proactive approach in addressing training anomalies can lead to more robust models in the future.

In summary, the LLM360 K2 project's commitment to openness and reproducibility not only marks a significant milestone in the development of LLMs but also sets a precedent for future projects to prioritize transparency and collaboration in AI research.

PDF Markdown Bookmark Chat (Pro)

Authors (25)

Zhengzhong Liu (28 papers)
Bowen Tan (23 papers)
Hongyi Wang (62 papers)
Willie Neiswanger (68 papers)
Tianhua Tao (10 papers)
Haonan Li (43 papers)
Fajri Koto (47 papers)
Yuqi Wang (62 papers)
Suqi Sun (2 papers)
Omkar Pangarkar (2 papers)
Richard Fan (11 papers)
Yi Gu (69 papers)
Victor Miller (5 papers)
Liqun Ma (8 papers)
Liping Tang (23 papers)
Nikhil Ranjan (3 papers)
Yonghao Zhuang (10 papers)
Guowei He (19 papers)
Renxi Wang (8 papers)
Mingkai Deng (5 papers)

Related Papers

Find Related Papers

Tweets

https://twitter.com/ericxing/status/1880265179184918738

https://twitter.com/Soumikgreen/status/1881441524250198087

https://twitter.com/miguelinlas3/status/1882403408306376727