Papers
Topics
Authors
Recent
Search
2000 character limit reached

Softplus Attention with Re-weighting (LSSAR)

Updated 15 December 2025
  • The paper introduces LSSAR, replacing Softmax with Softplus activation and integrating a dynamic, row-dependent scaling factor to maintain attention entropy.
  • It employs a novel re-weighting mechanism that sharpens attention distributions and mitigates numerical instability over exceedingly long sequence lengths.
  • Experimental results demonstrate nearly invariant validation loss up to 16× the training context, highlighting its effectiveness in long-context regimes.

Softplus Attention with Re-weighting (LSSAR) is an attention mechanism for LLMs that replaces the standard Softmax nonlinearity with the Softplus activation, introduces a dynamic, row-dependent scaling factor based on invariance entropy, and applies a parametric re-weighting scheme to enhance the model’s length extrapolation and stability at large sequence lengths. LSSAR is designed to address the numerical instability and degraded performance observed in standard Softmax attention as inference sequence lengths markedly exceed those seen during training, delivering substantially flatter validation loss up to 16× the training context size and enabling effective attention distribution even in challenging long-context regimes (Gao et al., 23 Jan 2025).

1. Decomposition of Softmax Attention

Standard self-attention employs Q (query), K (key), and V (value) projections, producing pairwise scores S=QKS = Q K^\top (optionally masked). The classic Softmax is then applied row-wise, defined for xRLx \in \mathbb{R}^L as

Softmax(x)i=exij=1Lexj\mathrm{Softmax}(x)_i = \frac{e^{x_i}}{\sum_{j=1}^L e^{x_j}}

This can be recast as a two-step operation:

  1. A pointwise nonlinear transformation ϕ(x)=ex\phi(x) = e^{x}
  2. Row-wise 1\ell_1 normalization: xx/x1x \mapsto x / \|x\|_1

Explicitly, the attention weights for a (masked) attention matrix SS are

Ai,j=ϕ(Si,j)kϕ(Si,k)A_{i,j} = \frac{\phi(S_{i,j})}{\sum_{k} \phi(S_{i,k})}

The 1\ell_1 normalization is essential to maintain valid attention distributions, while the exponential map ϕ\phi is responsible for producing sparse, sharply peaked attention. Empirically, however, as sequence lengths increase during inference, this exponential can lead to numerical instability and attention vanishing or saturating at extremes, diminishing extrapolative capacity (Gao et al., 23 Jan 2025).

2. Softplus Activation as Nonlinear Substitute

The Softplus activation function is defined as

Softplus(x)=log(1+ex)\mathrm{Softplus}(x) = \log(1 + e^x)

Notable characteristics include:

  • Strictly increasing, smooth (CC^\infty)
  • Linear growth for large xx (Softplus(x)x\mathrm{Softplus}(x) \sim x as xx \to \infty), moderating the runaway effects of exponentials
  • Non-negative outputs, preserving the role of Softmax’s exponentiation
  • The derivative is the sigmoid function, σ(x)=1/(1+ex)\sigma(x) = 1/(1+e^{-x}), bounded in (0,1), which confers bounded sensitivity to input changes

By substituting ϕ(x)\phi(x) with Softplus(x)\mathrm{Softplus}(x) in the attention computation, the numerical instabilities associated with exponentiation are mitigated, and the flattening or oversharpening of distributions at extreme context lengths is reduced (Gao et al., 23 Jan 2025).

3. Dynamic Entropy-Aware Scaling

While classically the scaled dot-product attention uses a fixed temperature 1/d1/\sqrt{d}, recent findings indicate that scaling proportional to logL/d\log L / \sqrt{d} (with LL context length) better preserves the entropy of the attention as LL varies (Su, 2021; Chiang & Cholak, 2022). LSSAR generalizes this insight by defining a scaling factor that varies with the row index (i.e., current context length ii):

scalei=logdlogi\mathrm{scale}_i = \frac{\log d}{\log i}

where dd is the projection dimension.

For each row ii and column jj of the attention matrix, the scaled dot product prior to nonlinearity is:

Si,j=scalei(QiKj)S_{i,j} = \mathrm{scale}_i \cdot (Q_i \cdot K_j)

The adaptive scale preserves the effective entropy of attention distribution as the context grows, preventing undue sharpening or flattening at atypical sequence lengths.

After Softplus activation:

Ai,j=Softplus(Si,j)Mi,jA_{i,j} = \mathrm{Softplus}(S_{i,j}) \cdot M'_{i,j}

and final row normalization yields valid attention weights.

4. Re-weighting Mechanism for Attention Sharpening

For very long contexts, attention scores may become diffuse (“flattening”), impeding the ability to focus on salient tokens. LSSAR applies a two-stage re-weighting post normalization:

  1. Shift and scaling: A=ANOA' = A \odot N - O, where Ni,j=iN_{i,j}=i encodes row position, OO is the all-ones matrix, and \odot denotes elementwise product. This centering ensures the mean is zero and mitigates bias as a function of position.
  2. Parametric ReLU-power sharpen: Ci,j=(max(Ai,j,0))pC_{i,j} = (\max(A'_{i,j}, 0))^p, where pp is a positive exponent. Increasing pp increases the separation between large and small entries, strongly favoring maximal attention weights.

Finally, another 1\ell_1 normalization is applied to each row (Ci,:C_{i,:}), yielding the re-weighted attention:

Ai,jRW=Ci,jkCi,kA^{RW}_{i,j} = \frac{C_{i,j}}{\sum_k C_{i,k}}

For practical purposes, pp is selected in {3,,15}\{3, \ldots, 15\} depending on sequence length, ensuring sharpness without inducing instabilities. The Softplus’s bounded gradient ensures the procedure is numerically robust even as pp is increased (Gao et al., 23 Jan 2025).

5. Unified LSSAR Attention Computation

The full LSSAR procedure for attention weights ALSSARA^{\mathrm{LSSAR}} comprises:

Step Operation Notation
1. Raw dot-products S=QKS = Q K^\top SS
2. Dynamic scaling S^=S(logd/logN)\hat{S} = S \odot (\log d / \log N) S^\hat{S}
3. Softplus non-linearity with mask B=Softplus(S^)MB = \mathrm{Softplus}(\hat{S}) \odot M' BB
4. Row 1\ell_1 normalization A=B/B1A = B / \lVert B \rVert_1 AA
5. Shift & ReLU-power C=(max(AN1,0))pC = (\max(A \odot N - 1,\,0))^p CC
6. Final 1\ell_1 normalization ALSSAR=C/C1A^{\mathrm{LSSAR}} = C / \lVert C \rVert_1 ALSSARA^{\mathrm{LSSAR}}
7. Weighted value aggregation O=ALSSARVO = A^{\mathrm{LSSAR}} V OO

This construction maintains numerical stability, length-adaptive entropy, and peakedness of attention simultaneously (Gao et al., 23 Jan 2025).

6. Experimental Methodology

The LSSAR mechanism was evaluated using a GPT-2-small model (124M parameters), with Rotary Positional Embeddings (RoPE) and NTK scaling for extrapolation. Training was conducted on eight A100 80GB GPUs for 18,865 steps with FineWeb-10B (10.2B tokens), and Ltrain=1024L_\mathrm{train}=1024. Specific hyperparameters:

  • Softplus replaces e()e^{(\cdot)} in attention weights
  • Dynamic scale: logd/logN\log d / \log N with d=64d=64
  • Re-weight exponent pp tuned per inference length (p{3,15}p\in\{3,15\})

Performance was evaluated on validation loss at inference context lengths of up to 16×16\times training length.

7. Quantitative Analysis: Length Extrapolation and Stability

In comparative evaluation, LSSAR demonstrated nearly constant or very slowly increasing validation loss as inference context grew:

  • Softmax baseline: Validation loss increased rapidly (3.19 at 1K, 4.17 at 2K, 5.45 at 4K, 6.28 at 8K tokens, and failure at 16K).
  • LSSAR attention: Validation loss was nearly invariant (3.18 at 1K, 4.23 at 2K, 5.40 at 4K, 6.30 at 8K, and re-centered 3.31 at 16K).

Ablation studies indicated that increasing the re-weight exponent pp in the Softmax baseline quickly induced gradient explosion, while LSSA (Softplus, no re-weight) remained stable for very large pp. LSSAR thus enables sharpening attention distributions at large scale without the risk of instability inherent to exponentiation-based formulations. The findings implicate Softplus’s bounded derivatives and entropy-aware dynamic scaling as critical for robust long-context extrapolation (Gao et al., 23 Jan 2025).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Softplus Attention with Re-weighting (LSSAR).