Papers
Topics
Authors
Recent
Search
2000 character limit reached

Scalable-Softmax Is Superior for Attention

Published 31 Jan 2025 in cs.CL, cs.AI, and cs.LG | (2501.19399v1)

Abstract: The maximum element of the vector output by the Softmax function approaches zero as the input vector size increases. Transformer-based LLMs rely on Softmax to compute attention scores, causing the attention distribution to flatten as the context size grows. This reduces the model's ability to prioritize key information effectively and potentially limits its length generalization. To address this problem, we propose Scalable-Softmax (SSMax), which replaces Softmax in scenarios where the input vector size varies. SSMax can be seamlessly integrated into existing Transformer-based architectures. Experimental results in language modeling show that models using SSMax not only achieve faster loss reduction during pretraining but also significantly improve performance in long contexts and key information retrieval. Furthermore, an analysis of attention scores reveals that SSMax enables the model to focus attention on key information even in long contexts. Additionally, although models that use SSMax from the beginning of pretraining achieve better length generalization, those that have already started pretraining can still gain some of this ability by replacing Softmax in the attention layers with SSMax, either during or after pretraining.

Summary

  • The paper introduces Scalable-Softmax (SSMax) to mitigate attention fading in Transformers, enhancing length generalization over extended contexts.
  • It demonstrates mathematically and experimentally that SSMax reduces training loss and improves key token retrieval in long sequences.
  • SSMax integrates with Transformer architectures through minimal modifications, offering efficiency gains and better performance in extended-context scenarios.

Scalable-Softmax Is Superior for Attention

Introduction

The paper "Scalable-Softmax Is Superior for Attention" (2501.19399) investigates an important challenge in Transformer-based LLMs known as length generalization, which is the ability to handle context sizes longer than those used during training. Traditional attention mechanisms utilize Softmax to compute attention scores, which leads to the flattening of attention distributions as context size grows, potentially limiting length generalization. The proposed solution, Scalable-Softmax (SSMax), addresses this problem by offering a scalable alternative to Softmax specifically designed to maintain effective attention focus across variable input vector sizes, thereby enhancing length generalization capabilities in Transformer architectures.

The significance of the problem is underscored by the quadratically growing computational and memory requirements for Transformer training, which impose practical limitations on context sizes. Current approaches to length generalization include improving positional encoding methods, adopting sparse attention mechanisms, further training on longer contexts after modifying positional encodings, and enhancing attention mechanisms. The focus here is on enhancing attention mechanisms by introducing SSMax.

Scalable-Softmax (SSMax) Design

SSMax modifies the Softmax probability distribution transformation to prevent attention fading in growing input vector sizes. The Softmax function interprets vector elements as a probability distribution whose elements sum to one. However, due to vector size dependence, the result can become increasingly flat, leading to attention fading. SSMax addresses this by adjusting the exponential base in its formulation to integrate the input vector size, thereby maintaining focus on key tokens even as input size grows.

The SSMax function is mathematically defined as:

zi↦nszi∑j=1nnszj,z_i \mapsto \frac{n^{sz_i}}{\sum_{j=1}^n n^{sz_j}},

where ss is a scalar scaling parameter. This formulation is distinct from Softmax in its adaptability to varying input vector sizes, preserving effective attention allocation regardless of these sizes. Figure 1

Figure 1: Comparison of Softmax and SSMax, illustrating the issue of attention fading and the effectiveness of SSMax in preventing it.

Experimental Evidence and Analysis

To validate the SSMax design, experiments analyzed how attention scores ideally depend on input vector size. It was found that attention mechanisms benefit from the SSMax formulation, confirming a logarithmic relationship between input size and attention distribution focus.

Theoretical analysis highlights the extent of attention fading under Softmax and clarifies how SSMax manages to maintain attention on significant tokens by ensuring that, with zmax−z2nd>1sz_\mathrm{max} - z_\mathrm{2nd} > \frac{1}{s}, the attention remains focused. This property allows dynamic adjustment of attention allocation based on input values. Figure 2

Figure 2: Relationship between pnp_n and the input vector size $n.

Implementation and Evaluation

SSMax implementation integrates seamlessly into existing Transformer architectures with minimal changes required, enhancing efficiency without affecting compatibility.

Evaluations indicate SSMax's effectiveness in lowering training loss consistently across multiple model variants compared to standard Transformers. Furthermore, it improves long-context generalization and key information retrieval performance significantly, as demonstrated by per-position test loss assessments on sequences much longer than those seen during training. Figure 3

Figure 3: Learning curves comparing the standard Transformer and SSMax variants.

Figure 4

Figure 4: Per-position test loss across context sizes up to 20,000.

Figure 5

Figure 5: Needle-In-A-Haystack test results showcasing retrieval accuracy.

Attention score analysis further consolidates SSMax's advantage in focusing attention effectively on key tokens, particularly in extended contexts, fostering improved key information retrieval. Figure 6

Figure 6: Needle score distribution across attention layers and heads.

Figure 7

Figure 7: Top needle score distribution across models.

Conclusion

SSMax presents a formidable alternative to Softmax within Transformer attention layers, addressing the limitation of attention fading and enhancing length generalization. Through rigorous experimentation, SSMax is shown to improve training efficiency, maintain lower test loss across extended contexts, and excel in key information retrieval tasks within long contexts. While benefits accrue most substantially when SSMax is incorporated from the beginning of training, switching from Softmax to SSMax during or after pretraining also yields noticeable improvements. Its potential to adapt and enhance Transformer-based LLMs positions SSMax as an attractive candidate for broader adoption. Future work may focus on extending SSMax's application across diverse Transformer architectures and exploring its integration into existing pretrained models.

Paper to Video (Beta)

Whiteboard

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Authors (1)

Collections

Sign up for free to add this paper to one or more collections.

Tweets

Sign up for free to view the 13 tweets with 73 likes about this paper.

HackerNews