Softplus Attention with Re-weighting (LSSAR)

Updated 15 December 2025

The paper introduces LSSAR, replacing Softmax with Softplus activation and integrating a dynamic, row-dependent scaling factor to maintain attention entropy.
It employs a novel re-weighting mechanism that sharpens attention distributions and mitigates numerical instability over exceedingly long sequence lengths.
Experimental results demonstrate nearly invariant validation loss up to 16× the training context, highlighting its effectiveness in long-context regimes.

Softplus Attention with Re-weighting (LSSAR) is an attention mechanism for LLMs that replaces the standard Softmax nonlinearity with the Softplus activation, introduces a dynamic, row-dependent scaling factor based on invariance entropy, and applies a parametric re-weighting scheme to enhance the model’s length extrapolation and stability at large sequence lengths. LSSAR is designed to address the numerical instability and degraded performance observed in standard Softmax attention as inference sequence lengths markedly exceed those seen during training, delivering substantially flatter validation loss up to 16× the training context size and enabling effective attention distribution even in challenging long-context regimes (Gao et al., 23 Jan 2025).

1. Decomposition of Softmax Attention

Standard self-attention employs Q (query), K (key), and V (value) projections, producing pairwise scores $S = Q K^\top$ (optionally masked). The classic Softmax is then applied row-wise, defined for $x \in \mathbb{R}^L$ as

$\mathrm{Softmax}(x)_i = \frac{e^{x_i}}{\sum_{j=1}^L e^{x_j}}$

This can be recast as a two-step operation:

A pointwise nonlinear transformation $\phi(x) = e^{x}$
Row-wise $\ell_1$ normalization: $x \mapsto x / \|x\|_1$

Explicitly, the attention weights for a (masked) attention matrix $S$ are

$A_{i,j} = \frac{\phi(S_{i,j})}{\sum_{k} \phi(S_{i,k})}$

The $\ell_1$ normalization is essential to maintain valid attention distributions, while the exponential map $\phi$ is responsible for producing sparse, sharply peaked attention. Empirically, however, as sequence lengths increase during inference, this exponential can lead to numerical instability and attention vanishing or saturating at extremes, diminishing extrapolative capacity (Gao et al., 23 Jan 2025).

2. Softplus Activation as Nonlinear Substitute

The Softplus activation function is defined as

$\mathrm{Softplus}(x) = \log(1 + e^x)$

Notable characteristics include:

Strictly increasing, smooth ( $C^\infty$ )
Linear growth for large $x$ ( $\mathrm{Softplus}(x) \sim x$ as $x \to \infty$ ), moderating the runaway effects of exponentials
Non-negative outputs, preserving the role of Softmax’s exponentiation
The derivative is the sigmoid function, $\sigma(x) = 1/(1+e^{-x})$ , bounded in (0,1), which confers bounded sensitivity to input changes

By substituting $\phi(x)$ with $\mathrm{Softplus}(x)$ in the attention computation, the numerical instabilities associated with exponentiation are mitigated, and the flattening or oversharpening of distributions at extreme context lengths is reduced (Gao et al., 23 Jan 2025).

3. Dynamic Entropy-Aware Scaling

While classically the scaled dot-product attention uses a fixed temperature $1/\sqrt{d}$ , recent findings indicate that scaling proportional to $\log L / \sqrt{d}$ (with $L$ context length) better preserves the entropy of the attention as $L$ varies (Su, 2021; Chiang & Cholak, 2022). LSSAR generalizes this insight by defining a scaling factor that varies with the row index (i.e., current context length $i$ ):

$\mathrm{scale}_i = \frac{\log d}{\log i}$

where $d$ is the projection dimension.

For each row $i$ and column $j$ of the attention matrix, the scaled dot product prior to nonlinearity is:

$S_{i,j} = \mathrm{scale}_i \cdot (Q_i \cdot K_j)$

The adaptive scale preserves the effective entropy of attention distribution as the context grows, preventing undue sharpening or flattening at atypical sequence lengths.

After Softplus activation:

$A_{i,j} = \mathrm{Softplus}(S_{i,j}) \cdot M'_{i,j}$

and final row normalization yields valid attention weights.

4. Re-weighting Mechanism for Attention Sharpening

For very long contexts, attention scores may become diffuse (“flattening”), impeding the ability to focus on salient tokens. LSSAR applies a two-stage re-weighting post normalization:

Shift and scaling: $A' = A \odot N - O$ , where $N_{i,j}=i$ encodes row position, $O$ is the all-ones matrix, and $\odot$ denotes elementwise product. This centering ensures the mean is zero and mitigates bias as a function of position.
Parametric ReLU-power sharpen: $C_{i,j} = (\max(A'_{i,j}, 0))^p$ , where $p$ is a positive exponent. Increasing $p$ increases the separation between large and small entries, strongly favoring maximal attention weights.

Finally, another $\ell_1$ normalization is applied to each row ( $C_{i,:}$ ), yielding the re-weighted attention:

$A^{RW}_{i,j} = \frac{C_{i,j}}{\sum_k C_{i,k}}$

For practical purposes, $p$ is selected in $\{3, \ldots, 15\}$ depending on sequence length, ensuring sharpness without inducing instabilities. The Softplus’s bounded gradient ensures the procedure is numerically robust even as $p$ is increased (Gao et al., 23 Jan 2025).

5. Unified LSSAR Attention Computation

The full LSSAR procedure for attention weights $A^{\mathrm{LSSAR}}$ comprises:

Step	Operation	Notation
1. Raw dot-products	$S = Q K^\top$	$S$
2. Dynamic scaling	$\hat{S} = S \odot (\log d / \log N)$	$\hat{S}$
3. Softplus non-linearity with mask	$B = \mathrm{Softplus}(\hat{S}) \odot M'$	$B$
4. Row $\ell_1$ normalization	$A = B / \lVert B \rVert_1$	$A$
5. Shift & ReLU-power	$C = (\max(A \odot N - 1,\,0))^p$	$C$
6. Final $\ell_1$ normalization	$A^{\mathrm{LSSAR}} = C / \lVert C \rVert_1$	$A^{\mathrm{LSSAR}}$
7. Weighted value aggregation	$O = A^{\mathrm{LSSAR}} V$	$O$

This construction maintains numerical stability, length-adaptive entropy, and peakedness of attention simultaneously (Gao et al., 23 Jan 2025).

6. Experimental Methodology

The LSSAR mechanism was evaluated using a GPT-2-small model (124M parameters), with Rotary Positional Embeddings (RoPE) and NTK scaling for extrapolation. Training was conducted on eight A100 80GB GPUs for 18,865 steps with FineWeb-10B (10.2B tokens), and $L_\mathrm{train}=1024$ . Specific hyperparameters:

Softplus replaces $e^{(\cdot)}$ in attention weights
Dynamic scale: $\log d / \log N$ with $d=64$
Re-weight exponent $p$ tuned per inference length ( $p\in\{3,15\}$ )

Performance was evaluated on validation loss at inference context lengths of up to $16\times$ training length.

7. Quantitative Analysis: Length Extrapolation and Stability

In comparative evaluation, LSSAR demonstrated nearly constant or very slowly increasing validation loss as inference context grew:

Softmax baseline: Validation loss increased rapidly (3.19 at 1K, 4.17 at 2K, 5.45 at 4K, 6.28 at 8K tokens, and failure at 16K).
LSSAR attention: Validation loss was nearly invariant (3.18 at 1K, 4.23 at 2K, 5.40 at 4K, 6.30 at 8K, and re-centered 3.31 at 16K).

Ablation studies indicated that increasing the re-weight exponent $p$ in the Softmax baseline quickly induced gradient explosion, while LSSA (Softplus, no re-weight) remained stable for very large $p$ . LSSAR thus enables sharpening attention distributions at large scale without the risk of instability inherent to exponentiation-based formulations. The findings implicate Softplus’s bounded derivatives and entropy-aware dynamic scaling as critical for robust long-context extrapolation (Gao et al., 23 Jan 2025).

Markdown Report Issue Upgrade to Chat

References (1)

Softplus Attention with Re-weighting Boosts Length Extrapolation in Large Language Models (2025)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Softplus Attention with Re-weighting (LSSAR).

Softplus Attention with Re-weighting (LSSAR)

1. Decomposition of Softmax Attention

2. Softplus Activation as Nonlinear Substitute

3. Dynamic Entropy-Aware Scaling

4. Re-weighting Mechanism for Attention Sharpening

5. Unified LSSAR Attention Computation

6. Experimental Methodology

7. Quantitative Analysis: Length Extrapolation and Stability

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Don't miss out on important new AI/ML research

Softplus Attention with Re-weighting (LSSAR)

1. Decomposition of Softmax Attention

2. Softplus Activation as Nonlinear Substitute

3. Dynamic Entropy-Aware Scaling

4. Re-weighting Mechanism for Attention Sharpening

5. Unified LSSAR Attention Computation

6. Experimental Methodology

7. Quantitative Analysis: Length Extrapolation and Stability

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Related Topics

Don't miss out on important new AI/ML research

Sign up for free to explore the frontiers of research