Softplus Attention with Re-weighting (LSSAR)
- The paper introduces LSSAR, replacing Softmax with Softplus activation and integrating a dynamic, row-dependent scaling factor to maintain attention entropy.
- It employs a novel re-weighting mechanism that sharpens attention distributions and mitigates numerical instability over exceedingly long sequence lengths.
- Experimental results demonstrate nearly invariant validation loss up to 16× the training context, highlighting its effectiveness in long-context regimes.
Softplus Attention with Re-weighting (LSSAR) is an attention mechanism for LLMs that replaces the standard Softmax nonlinearity with the Softplus activation, introduces a dynamic, row-dependent scaling factor based on invariance entropy, and applies a parametric re-weighting scheme to enhance the model’s length extrapolation and stability at large sequence lengths. LSSAR is designed to address the numerical instability and degraded performance observed in standard Softmax attention as inference sequence lengths markedly exceed those seen during training, delivering substantially flatter validation loss up to 16× the training context size and enabling effective attention distribution even in challenging long-context regimes (Gao et al., 23 Jan 2025).
1. Decomposition of Softmax Attention
Standard self-attention employs Q (query), K (key), and V (value) projections, producing pairwise scores (optionally masked). The classic Softmax is then applied row-wise, defined for as
This can be recast as a two-step operation:
- A pointwise nonlinear transformation
- Row-wise normalization:
Explicitly, the attention weights for a (masked) attention matrix are
The normalization is essential to maintain valid attention distributions, while the exponential map is responsible for producing sparse, sharply peaked attention. Empirically, however, as sequence lengths increase during inference, this exponential can lead to numerical instability and attention vanishing or saturating at extremes, diminishing extrapolative capacity (Gao et al., 23 Jan 2025).
2. Softplus Activation as Nonlinear Substitute
The Softplus activation function is defined as
Notable characteristics include:
- Strictly increasing, smooth ()
- Linear growth for large ( as ), moderating the runaway effects of exponentials
- Non-negative outputs, preserving the role of Softmax’s exponentiation
- The derivative is the sigmoid function, , bounded in (0,1), which confers bounded sensitivity to input changes
By substituting with in the attention computation, the numerical instabilities associated with exponentiation are mitigated, and the flattening or oversharpening of distributions at extreme context lengths is reduced (Gao et al., 23 Jan 2025).
3. Dynamic Entropy-Aware Scaling
While classically the scaled dot-product attention uses a fixed temperature , recent findings indicate that scaling proportional to (with context length) better preserves the entropy of the attention as varies (Su, 2021; Chiang & Cholak, 2022). LSSAR generalizes this insight by defining a scaling factor that varies with the row index (i.e., current context length ):
where is the projection dimension.
For each row and column of the attention matrix, the scaled dot product prior to nonlinearity is:
The adaptive scale preserves the effective entropy of attention distribution as the context grows, preventing undue sharpening or flattening at atypical sequence lengths.
After Softplus activation:
and final row normalization yields valid attention weights.
4. Re-weighting Mechanism for Attention Sharpening
For very long contexts, attention scores may become diffuse (“flattening”), impeding the ability to focus on salient tokens. LSSAR applies a two-stage re-weighting post normalization:
- Shift and scaling: , where encodes row position, is the all-ones matrix, and denotes elementwise product. This centering ensures the mean is zero and mitigates bias as a function of position.
- Parametric ReLU-power sharpen: , where is a positive exponent. Increasing increases the separation between large and small entries, strongly favoring maximal attention weights.
Finally, another normalization is applied to each row (), yielding the re-weighted attention:
For practical purposes, is selected in depending on sequence length, ensuring sharpness without inducing instabilities. The Softplus’s bounded gradient ensures the procedure is numerically robust even as is increased (Gao et al., 23 Jan 2025).
5. Unified LSSAR Attention Computation
The full LSSAR procedure for attention weights comprises:
| Step | Operation | Notation |
|---|---|---|
| 1. Raw dot-products | ||
| 2. Dynamic scaling | ||
| 3. Softplus non-linearity with mask | ||
| 4. Row normalization | ||
| 5. Shift & ReLU-power | ||
| 6. Final normalization | ||
| 7. Weighted value aggregation |
This construction maintains numerical stability, length-adaptive entropy, and peakedness of attention simultaneously (Gao et al., 23 Jan 2025).
6. Experimental Methodology
The LSSAR mechanism was evaluated using a GPT-2-small model (124M parameters), with Rotary Positional Embeddings (RoPE) and NTK scaling for extrapolation. Training was conducted on eight A100 80GB GPUs for 18,865 steps with FineWeb-10B (10.2B tokens), and . Specific hyperparameters:
- Softplus replaces in attention weights
- Dynamic scale: with
- Re-weight exponent tuned per inference length ()
Performance was evaluated on validation loss at inference context lengths of up to training length.
7. Quantitative Analysis: Length Extrapolation and Stability
In comparative evaluation, LSSAR demonstrated nearly constant or very slowly increasing validation loss as inference context grew:
- Softmax baseline: Validation loss increased rapidly (3.19 at 1K, 4.17 at 2K, 5.45 at 4K, 6.28 at 8K tokens, and failure at 16K).
- LSSAR attention: Validation loss was nearly invariant (3.18 at 1K, 4.23 at 2K, 5.40 at 4K, 6.30 at 8K, and re-centered 3.31 at 16K).
Ablation studies indicated that increasing the re-weight exponent in the Softmax baseline quickly induced gradient explosion, while LSSA (Softplus, no re-weight) remained stable for very large . LSSAR thus enables sharpening attention distributions at large scale without the risk of instability inherent to exponentiation-based formulations. The findings implicate Softplus’s bounded derivatives and entropy-aware dynamic scaling as critical for robust long-context extrapolation (Gao et al., 23 Jan 2025).