Papers
Topics
Authors
Recent
Search
2000 character limit reached

Methods of improving LLM training stability

Published 22 Oct 2024 in cs.CL and cs.LG | (2410.16682v1)

Abstract: Training stability of LLMs(LLMs) is an important research topic. Reproducing training instabilities can be costly, so we use a small LLM with 830M parameters and experiment with higher learning rates to force models to diverge. One of the sources of training instability is the growth of logits in attention layers. We extend the focus of the previous work and look not only at the magnitude of the logits but at all outputs of linear layers in the Transformer block. We observe that with a high learning rate the L2 norm of all linear layer outputs can grow with each training step and the model diverges. Specifically we observe that QKV, Proj and FC2 layers have the largest growth of the output magnitude. This prompts us to explore several options: 1) apply layer normalization not only after QK layers but also after Proj and FC2 layers too; 2) apply layer normalization after the QKV layer (and remove pre normalization). 3) apply QK layer normalization together with softmax capping. We show that with the last two methods we can increase learning rate by 1.5x (without model divergence) in comparison to an approach based on QK layer normalization only. Also we observe significant perplexity improvements for all three methods in comparison to the baseline model.

Summary

  • The paper proposes extending layer normalization and softmax modifications to manage divergence in Transformer-based LLMs.
  • It demonstrates that techniques like QK_norm_cap and QKV_norm enable a 1.5x increase in learning rates without training divergence.
  • Experiments on an 830M-parameter model show significant perplexity improvements, validating the effectiveness of the proposed methods.

Improving LLM Training Stability

The paper "Methods of improving LLM training stability" explores strategies to enhance the training stability of LLMs by investigating the divergence issues associated with high learning rates. The authors extend prior work to analyze the magnitude growth in linear layer outputs within Transformer blocks, proposing novel methods to improve LLM training stability. Using a Transformer with 830M parameters, they demonstrate techniques that allow for increased learning rates without divergence, yielding significant perplexity improvements compared to baseline models.

Background and Motivation

Training LLMs with high stability is challenging due to potential instabilities during optimization. Increasing the learning rate often leads to divergence, primarily due to the growth of logits in attention layers. Prior research effectively uses small scale models to investigate instability mechanisms and apply methodologies like layer normalization (LN) after QK layers to control these issues. The paper extends this by analyzing the entire set of linear layer outputs to identify specific operations causing divergence.

Experimental Observations

The experimental setup replicates the conditions where LLMs diverge by employing a smaller model with an 830M parameter count, leveraging higher learning rates. When observing models trained under these setups, notable increases in the L2 norms of linear layer outputs—especially in QKV, Proj, and FC2 layers—significantly contribute to instability. This phenomenon correlates with divergence due to unintended impacts on softmax behavior, leading to gradient explosion. Figure 1

Figure 1: Transformer Block of baseline model.

Proposed Methods

The paper introduces several strategies to mitigate training instability by targeting output magnitude control:

  1. Layer Normalization Strategies: Extending LN beyond QK layers to Proj and FC2 layers, and modifying its application post-QKV layer to prevent redundancy in normalization steps.
  2. Softmax-Centric Techniques:
    • Softmax Temperature (soft_temp) and Softmax Capping (soft_cap): Adjustments to softmax calculation aimed at temperature control.
    • Softmax Clipping (soft_clip): Imposing softmax value limits to deter sensitivity and overtraining.
  3. Optimized Norm Applications:
    • Combination of LN with Capping (QK_norm_cap): Integrating LN on QK with softmax capping to achieve dual control at critical computation stages.
    • Focused QKV Normalization (QKV_norm): This emphasizes LN application selectively after QKV operations, simplifying stabilization mechanisms. Figure 2

      Figure 2: Example of training loss with learning rate(LR) (depending on training iteration).

Results and Analysis

Increased learning rates were tested across a range of models employing the aforementioned strategies. Specifically, both QK_norm_cap and QKV_norm allowed for a 1.5x increase in learning rates without encountering training divergence, a significant improvement over methods limited to LN application on QK layers alone.

Additionally, practical metrics such as perplexity evidenced substantial improvements with these modified models, highlighting the effectiveness of the normalization and capping strategies in promoting stable, efficient LLM training.

Conclusion

The research presented solidifies the importance of strategic normalization and output magnitude controls in managing LLM training stability. By innovating beyond established LN applications, the authors have demonstrated methods that allow high learning rates without sacrificing model convergence. Future exploration will extend these principles to larger and more complex models, advancing this work's foundational insights into scalable LLM training mechanisms.

Paper to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Collections

Sign up for free to add this paper to one or more collections.

Tweets

Sign up for free to view the 7 tweets with 235 likes about this paper.