The paper "Predictable Scale: Part I — Optimal Hyperparameter Scaling Law in LLM Pretraining" introduces a new hyperparameter scaling law, termed Step Law, for pretraining LLMs. The authors posit that Step Law can be used as a plug-and-play tool for optimizing the learning rate and batch size in LLM pretraining.
The paper's primary claims and contributions include:
- Convexity of the Hyperparameter Loss Landscape: The research demonstrates that the loss landscape, with respect to the learning rate and batch size, exhibits convexity under fixed model parameters and data size conditions. This convexity suggests the existence of an optimal hyperparameter plateau.
- Universal Hyperparameter Scaling Laws (Step Law): The paper introduces a universal and robust hyperparameter scaling law applicable across variations in model sparsity, training data distribution, and model shape. Step Law posits that the optimal learning rate, , and batch size, , follow power-law relationships:
where:
* is the number of non-embedding parameters in the model * is the dataset size in tokens.
The scaling laws suggest that the optimal batch size primarily depends on the dataset size, while the optimal learning rate depends on both model parameters and dataset size.
- Transferability and Invariance Across Data Distributions and Model Architectures: The paper investigates the transferability of optimal hyperparameter scaling laws across different pretraining data distributions and model architectures. The findings suggest that Step Law maintains high generalizability and robustness across different corpora distributions, model architectures, and both dense and sparse (MoE) LLMs with varying sparsity ratios.
- Extensive Empirical Validation: The conclusions are supported by a large-scale empirical paper involving:
- Experiments across 3,700 model configurations, training LLMs from scratch with dense and MoE architectures (varying sparsity ratios), data distributions, and hyperparameter settings.
- A compute consumption approaching 1 million H800 GPU hours, processing approximately 100 trillion tokens during training.
The paper compares Step Law with existing hyperparameter scaling approaches, including OpenAI Law, Microsoft Law, DeepSeek Law, Porian Law, MiniCPM Law, and MeiTuan Law. The comparison focuses on factors such as suitability for different data recipes, model sparsity, and relative error in loss prediction.
The paper uses the following notation:
- : Cross-entropy loss
- : Dataset size in tokens
- : Number of non-embedding parameters in the model
- : Total number of parameters in the model
- : Compute budget in FLOPs
- : Number of layers in the Transformer model
- : Dimension of the feed-forward network hidden layer in the Transformer
- : Hidden dimension of the Transformer model
- : Number of attention heads in the Transformer model
- : Optimal peak learning rate for a given parameter count and dataset size
- : Optimal batch size (in tokens) for a given parameter count and dataset size
The paper details the experimental setup, including the dataset composition (web text, mathematical content, and code), Byte Pair Encoding (BPE) tokenizer, model architecture (RMSNorm, SwiGLU activation function, ALiBi positional encoding), and optimizer (AdamW). The learning rate schedule includes a linear warmup phase followed by a cosine decay.
The ablation experiments validate the use of smoothed training loss as an unbiased estimate of validation loss and demonstrate the convexity of the loss landscape with respect to the learning rate and batch size. The authors also justify the use of a fixed final learning rate strategy.
The paper demonstrates the topological invariance of the hyperparameter scaling law across varied model shapes by conducting controlled experiments with different model shape combinations (number of layers, attention heads, feed-forward network dimensions). Additionally, the paper investigates the sparsity independence of the hyperparameter scaling law in MoE models across different sparsity levels and model shapes. The results show that Step Law consistently achieves a low relative prediction error across all sparsity levels. Finally, the paper assesses the robustness of Step Law across varied data distributions by designing three distinct data distributions: bilingual corpus, code integration, and code-dominant. The formula maintains predictive accuracy across all three distributions.
The authors acknowledge the limitations of their empirical approach and call for future work to develop a theoretical understanding of the observed power-law relationships.