The paper "Temporal Scaling Law for LLMs" explores the dynamics of loss function behavior in LLMs throughout the training process, introducing the concept of the Temporal Scaling Law. Unlike traditional scaling laws, which focus on static properties such as model size and dataset size, this work investigates the temporal dimension of training, specifically how loss evolves over time across different token positions in sequences.
Key contributions and findings of the paper include:
- Temporal Scaling Law: The researchers propose that, during the training of decoder-based generative LLMs, the loss at different token positions follows a reciprocal-law, which is consistent across various scales and training stages. This reciprocal-law can be mathematically captured by:
Where: - : Loss at token position . - , , : Empirically derived parameters based on model scale and training time.
- Temporal Patterns:
The evolution of the parameters , , and is meticulously studied over time. It is found that: - Prior to a specific threshold in training, shows a logarithmic relationship, and a reciprocal relationship with the number of trained tokens. - parameter closely follows the learning rate decay pattern, suggesting its strong correlation with this training aspect.
- Loss Prediction: The established temporal scaling patterns allow for the prediction of future test loss, leveraging initial training data to accurately forecast the training trajectory. Empirical tests indicate a marked improvement in predictive accuracy over baseline approaches such as those using exponential functions or simple reciprocal laws.
- Uniform Learning Across Tokens: Despite initial disparities in loss across token positions, the paper observes that LLMs tend to learn uniformly across all token positions after an initial phase of training. This suggests that the existing paradigm of averaging losses across tokens without position-based weighting is a robust approach to LLM training, as re-weighting strategies do not confer performance benefits.
- Verification through Experiments: The research incorporates two primary datasets; an in-distribution (IID) dataset from the Pile and an out-of-distribution (OOD) dataset from PG-19. Various scales of models are trained to validation the reciprocal-law and predict future test loss utilizing the derived temporal scaling law.
The paper effectively challenges the conventional understanding of scaling laws by introducing a temporal component, thereby enabling a nuanced analysis of LLM training dynamics that could inform more efficient resource utilization and training strategy development in large-scale model pre-training.