Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
38 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Temporal Scaling Law for Large Language Models (2404.17785v2)

Published 27 Apr 2024 in cs.CL
Temporal Scaling Law for Large Language Models

Abstract: Recently, LLMs have been widely adopted in a wide range of tasks, leading to increasing attention towards the research on how scaling LLMs affects their performance. Existing works, termed Scaling Laws, have discovered that the final test loss of LLMs scales as power-laws with model size, computational budget, and dataset size. However, the temporal change of the test loss of an LLM throughout its pre-training process remains unexplored, though it is valuable in many aspects, such as selecting better hyperparameters \textit{directly} on the target LLM. In this paper, we propose the novel concept of Temporal Scaling Law, studying how the test loss of an LLM evolves as the training steps scale up. In contrast to modeling the test loss as a whole in a coarse-grained manner, we break it down and dive into the fine-grained test loss of each token position, and further develop a dynamic hyperbolic-law. Afterwards, we derive the much more precise temporal scaling law by studying the temporal patterns of the parameters in the dynamic hyperbolic-law. Results on both in-distribution (ID) and out-of-distribution (OOD) validation datasets demonstrate that our temporal scaling law accurately predicts the test loss of LLMs across training steps. Our temporal scaling law has broad practical applications. First, it enables direct and efficient hyperparameter selection on the target LLM, such as data mixture proportions. Secondly, viewing the LLM pre-training dynamics from the token position granularity provides some insights to enhance the understanding of LLM pre-training.

The paper "Temporal Scaling Law for LLMs" explores the dynamics of loss function behavior in LLMs throughout the training process, introducing the concept of the Temporal Scaling Law. Unlike traditional scaling laws, which focus on static properties such as model size and dataset size, this work investigates the temporal dimension of training, specifically how loss evolves over time across different token positions in sequences.

Key contributions and findings of the paper include:

  1. Temporal Scaling Law: The researchers propose that, during the training of decoder-based generative LLMs, the loss at different token positions follows a reciprocal-law, which is consistent across various scales and training stages. This reciprocal-law can be mathematically captured by:

Li=a01+a1(i1)+a2\mathcal{L}_i = \frac{a_0}{1 + a_1(i-1)} + a_2

Where: - Li\mathcal{L}_i: Loss at token position ii. - a0a_0, a1a_1, a2a_2: Empirically derived parameters based on model scale and training time.

  1. Temporal Patterns:

The evolution of the parameters a0a_0, a1a_1, and a2a_2 is meticulously studied over time. It is found that: - Prior to a specific threshold in training, a0a_0 shows a logarithmic relationship, and a1a_1 a reciprocal relationship with the number of trained tokens. - parameter a2a_2 closely follows the learning rate decay pattern, suggesting its strong correlation with this training aspect.

  1. Loss Prediction: The established temporal scaling patterns allow for the prediction of future test loss, leveraging initial training data to accurately forecast the training trajectory. Empirical tests indicate a marked improvement in predictive accuracy over baseline approaches such as those using exponential functions or simple reciprocal laws.
  2. Uniform Learning Across Tokens: Despite initial disparities in loss across token positions, the paper observes that LLMs tend to learn uniformly across all token positions after an initial phase of training. This suggests that the existing paradigm of averaging losses across tokens without position-based weighting is a robust approach to LLM training, as re-weighting strategies do not confer performance benefits.
  3. Verification through Experiments: The research incorporates two primary datasets; an in-distribution (IID) dataset from the Pile and an out-of-distribution (OOD) dataset from PG-19. Various scales of models are trained to validation the reciprocal-law and predict future test loss utilizing the derived temporal scaling law.

The paper effectively challenges the conventional understanding of scaling laws by introducing a temporal component, thereby enabling a nuanced analysis of LLM training dynamics that could inform more efficient resource utilization and training strategy development in large-scale model pre-training.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (9)
  1. Yizhe Xiong (14 papers)
  2. Xiansheng Chen (1 paper)
  3. Xin Ye (47 papers)
  4. Hui Chen (298 papers)
  5. Zijia Lin (43 papers)
  6. Haoran Lian (6 papers)
  7. Jianwei Niu (42 papers)
  8. Guiguang Ding (79 papers)
  9. Zhenpeng Su (17 papers)
Citations (5)
X Twitter Logo Streamline Icon: https://streamlinehq.com

Tweets