Softpick: No Attention Sink, No Massive Activations with Rectified Softmax (2504.20966v2)
Abstract: We introduce softpick, a rectified, not sum-to-one, drop-in replacement for softmax in transformer attention mechanisms that eliminates attention sink and massive activations. Our experiments with 340M and 1.8B parameter models demonstrate that softpick achieves 0\% sink rate consistently. The softpick transformers produce hidden states with significantly lower kurtosis and creates sparse attention maps. Quantized models using softpick outperform softmax on standard benchmarks, with a particularly pronounced advantage at lower bit precisions. Our analysis and discussion shows how softpick has the potential to open new possibilities for quantization, low-precision training, sparsity optimization, pruning, and interpretability. Our code is available at https://github.com/zaydzuhri/softpick-attention
Summary
- The paper presents Softpick, a modified attention function that overcomes attention sink by enabling zero outputs and stabilizes gradient behavior.
- It replaces Softmax's strict normalization with a rectified formulation, yielding sparser, more interpretable attention maps and enhanced quantization.
- Experimental results on Llama-style Transformers show comparable training loss with Softmax alongside improved quantization and reduced activation outliers.
This paper introduces Softpick, a proposed replacement for the standard Softmax function in the attention mechanism of Transformer models. The primary motivation is to address two issues commonly observed in Transformer LLMs that use Softmax attention: attention sink and massive activations.
Problem:
- Attention Sink: Softmax's sum-to-one nature forces attention heads to allocate significant weight to specific, often semantically irrelevant, tokens (like the beginning-of-sequence token), even when they should ideally pay zero attention. While often not detrimental to downstream performance, this behavior is seen as an artifact of Softmax.
- Massive Activations: Attention sink is linked to the appearance of extremely large values in the hidden states of the transformer, which grow with model scale. These "massive activations" pose significant challenges for model quantization and low-precision training techniques.
Proposed Solution: Softpick Function
The Softpick function is designed as a drop-in replacement for Softmax in the attention mechanism. It modifies the exponential and normalization steps of Softmax to avoid the strict sum-to-one constraint and allow for zero-valued outputs (rectification).
The numerically safe definition of Softpick for a vector x∈RN is:
Softpick(x)i=∑j=1N∣exj−m−e−m∣+ϵReLU(exi−m−e−m)
where m=max(x) and 0<ϵ≪1.
The design rationale is to maintain characteristics of Softmax favorable for training (like a well-behaved Jacobian and bounded gradient norm) while enabling "null attention" (zero outputs) and decoupling the numerator and denominator normalization to break the sum-to-one constraint, thus mitigating attention sink. The use of ∣exj−m−e−m∣ in the denominator allows negative inputs to influence the normalization sum, ensuring gradient flow even for tokens receiving negative scores.
The derivative of Softpick is derived to show its compatibility with backpropagation, noting that it's slightly more complex than Softmax but manageable in implementations that recompute inputs, such as FlashAttention.
Implementation Details:
Softpick is implemented as a direct replacement for Softmax in the attention formula:
$Attention(\mathbf{Q}, \mathbf{K}, \mathbf{V})=Softpick\left(\frac{\mathbf{Q}\mathbf{K}^T}{\sqrt{d_k}\right)\mathbf{V}$
The paper provides algorithms (Algorithm 1 and 2 in Appendix A) for integrating Softpick into the FlashAttention-2 framework, confirming its compatibility with memory-efficient attention computation.
Experimental Setup:
To validate Softpick, the authors trained two 340M parameter Llama-style Transformer models from scratch for 100,000 steps (52 billion tokens) on the fineweb-edu dataset: one with Softmax attention and one with Softpick attention. They used the flash-linear-attention repository and its Flame training framework built on torchtitan.
The training configuration included:
- Hidden size: 1024
- Num. layers: 24
- Num. heads: 16
- Sequence length: 4096
- Optimizer: AdamW
- Learning rate: 3e-4 (cosine schedule)
- Global batch size: 128
Results:
- Training: The Softpick model achieved a training loss very close to the Softmax model (0.004 difference in final loss). The gradient norms also showed comparable behavior, supporting Softpick's training stability.
- Benchmarks: Softpick maintained performance parity with Softmax on standard downstream benchmarks (ARC-e, Lambada, PIQA, SciQ). Softpick showed better perplexity on Wikitext and Lambada.
- Quantization: Softpick models consistently outperformed Softmax models when quantized (using HQQ, BitsandBytes, and GPTQ), especially at lower bit precisions (2-bit, 3-bit). This suggests Softpick makes models more amenable to quantization. Full results are in Appendix C. | Task | Metric | Softmax | Softpick | Δ (S - V) | | :----------- | :--------------- | :-------- | :-------- | :--------------- | | Arc Easy | Acc Norm ↑ | 56.31 | 56.61 | +0.30 | | | Acc ↑ | 60.61 | 60.82 | +0.21 | | Lambada | Acc ↑ | 36.35 | 36.21 | -0.14 | | | Perplexity ↓ | 30.33 | 28.67 | -1.66 | | Piqa | Acc Norm ↑ | 66.43 | 66.27 | -0.16 | | | Acc ↑ | 66.87 | 66.43 | -0.44 | | Sciq | Acc Norm ↑ | 75.10 | 77.20 | +2.10 | | | Acc ↑ | 83.30 | 83.60 | +0.30 | | Wikitext | Perplexity ↓ | 204.94 | 141.56 | -63.38 |
- Attention Maps: Softpick produced sparser and more legible attention maps compared to Softmax. Attention scores were concentrated in specific areas, with many actual zero scores. (Visualizations in Figure 1 and Appendix D). Softpick achieved 46.97% attention map sparsity on average, compared to 0% for Softmax.
- Attention Sink: Softpick successfully eliminated attention sink, achieving a sink rate of 0% for both ϵs=0.3 and ϵs=0.6 thresholds (Table 2).
- Massive Activations: Softpick dramatically reduced massive activations in hidden states. Kurtosis of hidden state activations dropped from 33510 to 340. Minimum and maximum activation values were also significantly reduced (e.g., Max from 240.27 to 36.21), except for the last layer (Table 2, Figure 2).
Analysis and Implications:
- Quantization: By removing massive activations, Softpick alleviates a major challenge for quantization, leading to better performance at low bit precisions. This could enable simpler and more efficient quantization algorithms.
- Low-Precision Training: Massive activations are a key obstacle for stable low-precision training. Softpick's elimination of these outliers is hypothesized to allow simpler low-bit training schemes without divergence issues.
- Sparsity: Softpick induces sparsity in attention maps. This inherent sparsity can be potentially exploited for inference acceleration using sparse matrix multiplication kernels or by skipping computations involving zero scores.
- Active-Dormant Heads and Pruning: Softpick heads exhibit clearer "active" or "dormant" behavior. Dormant heads can be easily identified as those with all-zero attention scores, potentially simplifying attention head pruning techniques.
- Interpretability: The sparser, more concentrated attention maps produced by Softpick are argued to be more visually interpretable than dense Softmax maps, which could aid in understanding model behavior.
- Other Modalities: Attention sink and outliers are observed in other modalities like vision transformers and ASR models. Softpick could potentially be applied to these architectures to mitigate similar issues without requiring dedicated tokens or architectural modifications.
Limitations and Future Work:
- Long Context and Underscoring: A noted issue is that Softpick can assign smaller scores in long contexts with many negatively scored tokens, which can dilute the signal from value matrices and potentially hurt performance on tasks like passkey retrieval (Table 5). Softpick did not outperform Softmax on this specific task.
- Attempts to mitigate this using a "scalable-softpick" approach (multiplying QKT by a learned scaling factor and log(sequence length) before Softpick) did not yield significant improvements and sometimes hurt other metrics (Appendix B). Solving this underscoring issue remains an open problem.
Future work includes scaling up experiments to larger models (2B and 7B parameters), conducting more detailed analysis on the dynamics of attention sinks and massive activations during training, and finding a solution for the underscoring issue to improve long-context performance.
Overall, the paper presents Softpick as a promising alternative to Softmax attention, demonstrating its ability to eliminate attention sink and massive activations while maintaining core performance, significantly improving quantization robustness, and opening new avenues for sparsity, low-precision training, and interpretability.
Follow-up Questions
- How does Softpick's elimination of massive activations affect the robustness of Transformer models to adversarial inputs or noisy data?
- What are the computational tradeoffs, if any, of using Softpick instead of Softmax in terms of memory usage, training speed, and inference efficiency—especially at large scale?
- How does the increased sparsity in attention maps with Softpick influence the model's ability to capture long-range dependencies compared to Softmax, particularly on tasks requiring global context?
- What are the implications of Softpick's improved quantization performance for deploying language models on edge devices or in resource-constrained environments?
- Find recent papers about alternative attention normalization techniques beyond Softmax and Softpick.
Related Papers
- The Hedgehog & the Porcupine: Expressive Linear Attentions with Softmax Mimicry (2024)
- When Attention Sink Emerges in Language Models: An Empirical View (2024)
- Rethinking Attention: Polynomial Alternatives to Softmax in Transformers (2024)
- Scalable-Softmax Is Superior for Attention (2025)
- Gated Attention for Large Language Models: Non-linearity, Sparsity, and Attention-Sink-Free (2025)