Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
119 tokens/sec
GPT-4o
56 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Small-scale proxies for large-scale Transformer training instabilities (2309.14322v2)

Published 25 Sep 2023 in cs.LG

Abstract: Teams that have trained large Transformer-based models have reported training instabilities at large scale that did not appear when training with the same hyperparameters at smaller scales. Although the causes of such instabilities are of scientific interest, the amount of resources required to reproduce them has made investigation difficult. In this work, we seek ways to reproduce and study training stability and instability at smaller scales. First, we focus on two sources of training instability described in previous work: the growth of logits in attention layers (Dehghani et al., 2023) and divergence of the output logits from the log probabilities (Chowdhery et al., 2022). By measuring the relationship between learning rate and loss across scales, we show that these instabilities also appear in small models when training at high learning rates, and that mitigations previously employed at large scales are equally effective in this regime. This prompts us to investigate the extent to which other known optimizer and model interventions influence the sensitivity of the final loss to changes in the learning rate. To this end, we study methods such as warm-up, weight decay, and the $\mu$Param (Yang et al., 2022), and combine techniques to train small models that achieve similar losses across orders of magnitude of learning rate variation. Finally, to conclude our exploration we study two cases where instabilities can be predicted before they emerge by examining the scaling behavior of model activation and gradient norms.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (16)
  1. Mitchell Wortsman (29 papers)
  2. Peter J. Liu (30 papers)
  3. Lechao Xiao (28 papers)
  4. Katie Everett (9 papers)
  5. Alex Alemi (9 papers)
  6. Ben Adlam (25 papers)
  7. John D. Co-Reyes (16 papers)
  8. Izzeddin Gur (23 papers)
  9. Abhishek Kumar (172 papers)
  10. Roman Novak (22 papers)
  11. Jeffrey Pennington (45 papers)
  12. Kelvin Xu (25 papers)
  13. Jaehoon Lee (62 papers)
  14. Justin Gilmer (39 papers)
  15. Simon Kornblith (53 papers)
  16. Jascha Sohl-Dickstein (88 papers)
Citations (55)

Summary

Evaluating Small-scale Proxies for Large-scale Transformer Training Instabilities

The rapid advancements in transformer models have resulted in significant achievements in areas such as natural language processing and computer vision. However, training these large models often encounters numerous challenges, including training instabilities that can lead to failure in achieving optimal performance. The paper "Small-scale proxies for large-scale Transformer training instabilities" addresses this critical issue by providing methodologies to replicate and paper the instabilities observed in large-scale transformer training in smaller models.

Key Findings

The authors identify two significant sources of instabilities: the growth of logits in attention layers and the divergence of output logits from the log probabilities. These instabilities, typically observed in large transformers, can also manifest in smaller models, particularly at increased learning rates. The paper reveals that interventions used at scale, such as qk-layernorm and z-loss regularization, are equally effective in mitigating these instabilities in smaller models.

The paper introduces a novel metric termed Learning Rate (LR) Sensitivity, which quantifies the expected deviation from an optimal loss when a model is trained with varied learning rates. The paper highlights how LR Sensitivity can reveal the stability of training runs, providing insights that can guide the tuning of parameters to achieve optimal convergence.

Methodological Approach

The authors employ a rigorous experimental setup, drawing parallels to the widely recognized GPT-2 architecture, for training small transformer models. To systematically explore the sources of instability, they measure the relationship between learning rate and loss across models of varying scales. This involves conducting experiments with diverse componential elements including warm-up schedules, weight decay, and the use of parameterization techniques like μParam.

The experimental evidence suggests that longer warm-up periods can reduce LR Sensitivity, thus allowing models to train more stably at higher learning rates. Similarly, decoupled weight decay, as recommended by previous literature, contributes significantly to improving model stability across learning rate variations.

Implications and Future Directions

The findings of this paper provide substantial contributions to the understanding and mitigation of large-scale transformer instabilities. By demonstrating that small-scale models can serve as effective proxies for their larger counterparts, the paper opens up new avenues for investigating training instabilities without the prohibitive costs of large-scale model training.

The implications of this research extend beyond practical aspects, offering theoretical insights into the mechanics of transformer training. Specifically, the scaling behavior analysis of model characteristics like gradient norms provides a predictive capacity for potential instabilities, suggesting that researchers may need to reconsider standard hyperparameter settings, particularly the AdamW epsilon hyperparameter.

The paper hints at several future research directions, including the exploration of additional optimizer and model structures to further understand their influence on training stability. Furthermore, the paper motivates the development of parameter-free methods that could inherently mitigate the necessity for extensive hyperparameter tuning.

In summary, the research provides a significant step forward in addressing issues of transformer training stability through a methodical examination of small-scale models. The implications of these findings suggest both immediate applications for model training and broader theoretical explorations into the behaviors of complex neural architectures.

Youtube Logo Streamline Icon: https://streamlinehq.com