Theory, Analysis, and Best Practices for Sigmoid Self-Attention (2409.04431v2)

Published 6 Sep 2024 in cs.LG

Abstract: Attention is a key part of the transformer architecture. It is a sequence-to-sequence mapping that transforms each sequence element into a weighted sum of values. The weights are typically obtained as the softmax of dot products between keys and queries. Recent work has explored alternatives to softmax attention in transformers, such as ReLU and sigmoid activations. In this work, we revisit sigmoid attention and conduct an in-depth theoretical and empirical analysis. Theoretically, we prove that transformers with sigmoid attention are universal function approximators and benefit from improved regularity compared to softmax attention. Through detailed empirical analysis, we identify stabilization of large initial attention norms during the early stages of training as a crucial factor for the successful training of models with sigmoid attention, outperforming prior attempts. We also introduce FLASHSIGMOID, a hardware-aware and memory-efficient implementation of sigmoid attention yielding a 17% inference kernel speed-up over FLASHATTENTION2 on H100 GPUs. Experiments across language, vision, and speech show that properly normalized sigmoid attention matches the strong performance of softmax attention on a wide range of domains and scales, which previous attempts at sigmoid attention were unable to fully achieve. Our work unifies prior art and establishes best practices for sigmoid attention as a drop-in softmax replacement in transformers.

Citations (5)

View on Semantic Scholar

Summary

The paper establishes that using alpha=1 prevents outputs from diverging or collapsing in sigmoid self-attention mechanisms.
It employs a sequence doubling argument to demonstrate that scaling with alpha=1 maintains consistent token outputs for varying sequence lengths.
The study provides theoretical insights and practical guidance for designing robust neural network architectures with scalable sigmoid attention.

Linear Scaling for Sigmoid Attention?

Introduction

The paper "Linear scaling for sigmoid attention?" investigates a crucial aspect of attention mechanisms in neural networks, specifically focusing on the sigmoid attention variant. This analysis probes into the scaling parameter $\alpha$ and its optimal value to ensure the stability and convergence of the attention mechanism outputs as the sequence length $n$ increases indefinitely.

Formulation and Objective

Given a sequence $X = (x_1, \dots, x_n) \in \mathbb{R}^{n\times d}$ , the sigmoid attention mechanism generates a new sequence $(y_1, \dots, y_n)$ . Each output token $y_i$ is defined as:

$y_i = \frac{1}{n^\alpha} \sum_{j=1}^n \sigma(x_i^TAx_j)W_vx_j$

where:

$\alpha \geq 0$ is the scaling parameter.
$A = \frac{W_q^T W_k}{\sqrt{d}}$ is a $d \times d$ matrix.
$W_v$ is the values matrix.
$\sigma$ represents the sigmoid activation function.

The primary objective is to identify the optimal value of $\alpha$ such that the outputs $y_i$ neither diverge to infinity nor collapse to zero as $n$ approaches infinity.

Analysis of $\alpha$

The expression for $y_i$ is recast as a scaled expectation:

$y_i = \frac{1}{n^{\alpha-1}} \mathbb{E}_{z \sim p_n}\left[\sigma(x_i^TAz)W_vz\right]$

Here, $p_n$ denotes the uniform distribution over the sequence $x_1, \dots, x_n$ . It is established that the expectation over $p_n$ converges to the expectation over the true distribution $p^*$ of the input tokens $x_i$ as $n$ trends towards infinity:

$y_i \simeq \frac{1}{n^{\alpha-1}} \int \sigma(x_i^TAz)W_vz \, dp^*(z)$

Three pivotal cases emerge based on the value of $\alpha$ :

If $\alpha > 1$ : Tokens collapse to zero.
If $\alpha < 1$ : Tokens diverge to infinity.
If $\alpha = 1$ : Tokens converge to a non-trivial limit.

Thus, it is inferred that $\alpha = 1$ is the appropriate scaling value to ensure balanced and stable outputs.

Sequence Doubling Argument

To further substantiate the above conclusion, a sequence doubling argument is employed. Consider appending a copy of the sequence to itself:

$\hat{X} = (x_1, x_1, \dots, x_n, x_n) \in \mathbb{R}^{2n \times d}$

The objective is to demonstrate that the sigmoid attention mechanism should yield an output sequence:

$\hat{Y} = (y_1, y_1, \dots, y_n, y_n)$

The analysis proceeds as follows:

$\hat{y}_1 = \frac{1}{(2n)^\alpha}\sum_{j=1}^{2n}\sigma(x_1^TA\hat{x}_j)W_v\hat{x}_j = \frac{2}{(2n)^\alpha}\sum_{j=1}^{n}\sigma(x_1^TAx_j)W_vx_j = \frac{2}{2^\alpha}y_1$

Subsequently, it is evident that only when $\alpha = 1$ does the property:

$\hat{y}_1 = y_1$

holds valid, thereby affirming that $\alpha = 1$ is the optimal scaling parameter.

Implications and Future Directions

The determination that $\alpha = 1$ governs the stable behavior of sigmoid attention has both theoretical and practical implications. Theoretically, this provides a clear guideline for the design of attention mechanisms. Practically, it facilitates the construction of more stable and efficient neural network architectures, especially for large-scale applications involving lengthy sequences.

Future research directions may include the exploration of scaling parameters within other variants of attention mechanisms and the investigation of how these findings can generalize across different neural network models. Furthermore, extensions to non-uniform and dynamic sequences may unveil additional insights into the robustness and versatility of attention mechanisms.