Steel-LLM:From Scratch to Open Source -- A Personal Journey in Building a Chinese-Centric LLM (2502.06635v2)

Published 10 Feb 2025 in cs.CL and cs.AI

Abstract: Steel-LLM is a Chinese-centric LLM developed from scratch with the goal of creating a high-quality, open-source model despite limited computational resources. Launched in March 2024, the project aimed to train a 1-billion-parameter model on a large-scale dataset, prioritizing transparency and the sharing of practical insights to assist others in the community. The training process primarily focused on Chinese data, with a small proportion of English data included, addressing gaps in existing open-source LLMs by providing a more detailed and practical account of the model-building journey. Steel-LLM has demonstrated competitive performance on benchmarks such as CEVAL and CMMLU, outperforming early models from larger institutions. This paper provides a comprehensive summary of the project's key contributions, including data collection, model design, training methodologies, and the challenges encountered along the way, offering a valuable resource for researchers and practitioners looking to develop their own LLMs. The model checkpoints and training script are available at https://github.com/zhanshijinwat/Steel-LLM.

Authors (4)

Qingshui Gu (1 paper)
Shu Li (51 papers)
Tianyu Zheng (28 papers)
Zhaoxiang Zhang (162 papers)

Summary

The paper outlines the creation of a 1B-parameter Chinese-centric LLM using a transparent, resource-efficient training pipeline.
The authors enhance the Transformer architecture with Flash Attention and Soft MOE in the FFN, alongside mixed-precision and parallelism techniques.
The work provides complete training details, including intermediate checkpoints and rigorous data preprocessing, to ensure full reproducibility.

The paper presents a comprehensive account of developing a Chinese-centric LLM titled "Steel-LLM: From Scratch to Open Source – A Personal Journey in Building a Chinese-Centric LLM." The work details a resource-efficient approach for training a 1-billion-parameter model that emphasizes transparency in both the training pipeline and data processing, while overcoming computational resource constraints (training on 8 GPUs). The authors aim to fill a gap in available open-source LLMs by focusing on Chinese language data and providing full replication details, including intermediate checkpoints and practical training insights.

Key Contributions and Architectural Innovations

Transformer-Based Architecture:
- Input tokens are represented as $\mathbf{X} \in \mathbb{R}^{m \times d}$ , where:
- $m$ : number of tokens,
- $d$ : token dimension.
- The dispatch and combination of token information involve softmax-normalized weight matrices over slots and experts, allowing a smooth integration of expert outputs.
Enhanced FFN Module:

The FFN is modified from the conventional two-layer MLP. In addition to integrating Soft MOE, the activation functions are enhanced by employing SwiGLU not only in the first layer but also in the second layer, thereby extending non-linear capacity and better capturing complex patterns in data.

Additional Architectural Details:
- Rotary Position Embedding (RoPE): Employed to encode relative positional information using a rotation matrix. The model uses a mixed-precision strategy where RoPE is computed in FP32 while training globally in BF16.
- Normalization: Pre-Norm RMSNorm is applied to both self-attention and FFN layers to stabilize training.
- Bias Treatment: Biases are removed from most layers except the QKV layers, reducing parameter overhead.

Training Methodology and Efficiency Optimizations

Training Infrastructure and Optimizations:
- Model Loading and Checkpointing: The framework adapts standard Transformers library conventions to support multiple architectures, while saving comprehensive training state (including data progress via a serialized data management class).
- Data Append Capability: A strategy is presented for re-shuffling indices when new data is appended mid-training. MD5 hashing is used to detect duplicate data files.
- Mixed-Precision and Parallelism: Exploitation of techniques such as bfloat16 mixed-precision, Fully Sharded Data Parallelism (FSDP), and operator fusion (implemented via CUDA and Triton) results in a roughly 50% speed boost.
- Ablation Study: An ablation paper (using a micro-batch of 8 and a 1.8-billion-parameter model on A100) quantifies improvements in tokens per second per GPU and corresponding GPU memory usage across different configurations.
Pretraining Setup:
- The model is trained over 1.07 million steps within a 30-day period with a maximum sequence length of 2048, involving around one trillion tokens.
- A cosine-annealing learning rate schedule is adopted with 2,000 warmup steps, a maximum learning rate of $3\times10^{-4}$ , and gradient clipping at a norm of 1.0.
- The training employs the AdamW optimizer with $\beta_1=0.9$ , $\beta_2=0.95$ and a weight decay of 0.05.
- Pretraining data is predominantly Chinese and derived from open-source datasets such as SkyPile-150B, Wanjuan1.0, Wikipedia-cn, Baidu Baike, and others, ensuring a large-scale corpus with rigorous filtering and deduplication.

Fine-Tuning and Evaluation

Supervised Fine-Tuning (SFT):
- Fine-tuning is performed over approximately 4 epochs with a global batch size of 256 and a maximum learning rate of $2\times10^{-5}$ .
- Ablation experiments vary data compositions and highlight that maintaining a balanced fine-tuning data distribution (e.g., incorporating 20% English data) enhances performance on both Chinese benchmarks (CEVAL and CMMLU) and broader multilingual benchmarks (MMLU).
- Notably, when additional English multiple-choice questions are included, the performance on MMLU improves from 26.75% to 30.82%.
Direct Preference Optimization (DPO):
- The global batch size for this regime is 128, with a maximum learning rate of $5 \times 10^{-6}$ and a pref_beta setting of 0.1.
Evaluation Metrics and Comparative Performance:
- Comparisons include Tiny-Llama-1.1B, Gemma-2b-it, and models of larger scale such as CT-LLM-SFT-2B, with Steel-LLM demonstrating a balanced performance trade-off between resource efficiency and benchmark accuracy.

Data Processing Pipeline and Transparency

Data Juicer Operators:
- For code processing, additional operations like link cleanup and specific repetition filtering are applied.
Openness:

A major emphasis is placed on reproducibility by fully releasing the training pipeline, intermediate checkpoints, model configurations, and detailed dataset descriptions. This transparency is intended to assist other researchers in replicating or extending the work, especially within resource-constrained environments.

In summary, the work presents a resource-efficient, transparent methodology for developing a Chinese-centric LLM with competitive performance on several benchmarks. By integrating architectural improvements such as Soft MOE within the Transformer framework, applying mixed-precision training, and detailing a full training pipeline along with comprehensive data processing strategies, the paper offers valuable insights and practical guidance for researchers building high-quality LLMs under limited computational resources.

PDF Markdown

GitHub

GitHub - zhanshijinwat/Steel-LLM: Train a 1B LLM with 1T tokens from scratch by personal (508 stars)

Steel-LLM:From Scratch to Open Source -- A Personal Journey in Building a Chinese-Centric LLM (2502.06635v2)

Summary

Related Papers

GitHub