Power Scheduler: A Batch Size and Token Number Agnostic Learning Rate Scheduler

Published 23 Aug 2024 in cs.CL, cs.AI, and cs.LG | (2408.13359v2)

Abstract: Finding the optimal learning rate for LLM pretraining is a challenging task. This is not only because there is a complicated correlation between learning rate, batch size, number of training tokens, model size, and other hyperparameters but also because it is prohibitively expensive to perform a hyperparameter search for LLMs with Billions or Trillions of parameters. Recent studies propose using small proxy models and small corpus to perform hyperparameter searches and transposing the optimal parameters to large models and large corpus. While the zero-shot transferability is theoretically and empirically proven for model size related hyperparameters, like depth and width, the zero-shot transfer from small corpus to large corpus is underexplored. In this paper, we study the correlation between optimal learning rate, batch size, and number of training tokens for the recently proposed WSD scheduler. After thousands of small experiments, we found a power-law relationship between variables and demonstrated its transferability across model sizes. Based on the observation, we propose a new learning rate scheduler, Power scheduler, that is agnostic about the number of training tokens and batch size. The experiment shows that combining the Power scheduler with Maximum Update Parameterization (muP) can consistently achieve impressive performance with one set of hyperparameters regardless of the number of training tokens, batch size, model size, and even model architecture. Our 3B dense and MoE models trained with the Power scheduler achieve comparable performance as state-of-the-art small LLMs. We open-source these pretrained models at https://ibm.biz/BdKhLa.

Abstract PDF HTML Chat (Pro)

Citations (4)

View on Semantic Scholar

Summary

The paper investigates learning rate dynamics, identifying dependencies on batch size and token count, and proposes the Power scheduler to overcome these limitations.
The paper shows the Power scheduler achieves performance comparable to or exceeding traditional methods across various model architectures and sizes, demonstrating zero-shot transferability.
The paper suggests the Power scheduler improves LLM training efficiency and flexibility by reducing dependence on specific batch sizes and token counts, enabling easier scaling.

An Analysis of "Power Scheduler: A Batch Size and Token Number Agnostic Learning Rate Scheduler"

The paper "Power Scheduler: A Batch Size and Token Number Agnostic Learning Rate Scheduler" presents an innovative approach to learning rate scheduling in the pretraining of LLMs. The study introduces the Power scheduler, a method designed to be robust against variations in batch size and number of training tokens, addressing constraints in existing schedulers like cosine and Warmup-Stable-Decay (WSD).

Key Contributions and Methodology

Investigation of Learning Rate Relationships: The paper begins by examining the intricate relationship between optimal learning rates, batch sizes, and numbers of training tokens. Using extensive experimentation, the authors identify a power-law correlation, demonstrating that the optimal learning rate for the WSD scheduler is contingent on token count and batch size. This result contradicts previous assumptions about the general transferability of hyperparameters across different scales in large models.
Proposition of the Power Scheduler: Based on their findings, the authors propose the Power scheduler, which modulates the learning rate using a power-law function. This scheduler is defined as:

$\eta_{\text{power}}(n) = \min(\eta_{\text{max}}, \beta \cdot a n^b)$

where $\eta_{\text{max}}$ serves as an upper bound, $\beta$ is the batch size, $n$ is the number of tokens trained thus far, and $a$ and $b$ are hyperparameters derived from the observed power-law relationship.
Robust Performance Across Models: The Power scheduler is shown to perform effectively across various model architectures, sizes, and training regimes. The paper presents controlled experiments with 1B parameter dense and mixture-of-experts (MoE) models, demonstrating that the Power scheduler can achieve or exceed the performance of traditional cosine and WSD schedulers.
Zero-shot Transferability: An additional advantage of the Power scheduler, as shown in the study, is its zero-shot hyperparameter transferability, which enables effective scaling from proxy models to large-scale training tasks without requiring extensive hyperparameter tuning.

Experimental Verification

The authors provide robust experimental support for their claims through a series of controlled experiments that include both 1B and more scalable 3B models. In these experiments, models trained with the Power scheduler frequently achieve results comparable to or surpassing other state-of-the-art models. The evaluations cover various metrics, including perplexity, and task performance benchmarks such as ARC, BoolQ, and others, where the Power scheduler demonstrates tangible improvements in generalization.

Implications and Future Directions

The introduction of the Power scheduler unlocks new potential in the training of LLMs by reducing the dependence on rigid hyperparameter specifications. This advancement holds significant implications for the efficiency and flexibility of LLM training, particularly in settings constrained by computational budget and pre-defined training step counts.

Furthermore, the scheduler’s parameter agnosticism suggests promising pathways for its application in dynamic training environments where model scales and data distributions evolve over time. One potential extension of this work could focus on adapting the Power scheduler for parallel and distributed training frameworks, further enhancing its applicability in diverse computational environments.

Conclusion

In summary, the paper contributes a novel learning rate scheduling framework that holds potential to simplify and enhance the training of LLMs. By demonstrating the power-law relationships within learning rate dynamics, and leveraging this insight to construct the Power scheduler, this research offers a strategically different paradigm with significant implications for both theoretical and applied AI research. As the field progresses towards more adaptable and efficient training methodologies, the ideas and results documented here will likely serve as a valuable reference for future advancements.