Dice Question Streamline Icon: https://streamlinehq.com

Resolve underscoring of softpick attention scores in long-context settings

Determine how to modify or augment the softpick attention normalization (which uses ReLU(exp(x_i) − 1) in the numerator and sums |exp(x_j) − 1| over tokens in the denominator) to prevent underscaled attention weights when sequence length is large and the attention pattern is sparse, a regime where many negative-scored tokens in the denominator suppress the few positive attention scores and degrade long-context retrieval performance (e.g., passkey retrieval).

Information Square Streamline Icon: https://streamlinehq.com

Background

The paper introduces softpick as a rectified, non–sum-to-one replacement for softmax in transformer attention that eliminates attention sink and massive activations while maintaining performance parity at 340M parameters and improving quantization robustness. Softpick computes attention weights using ReLU(exp(x) − 1) in the numerator and normalizes by the sum of |exp(x) − 1| across tokens, allowing exact zeros and avoiding sum-to-one constraints.

In long-context scenarios, the authors observe an "underscoring" issue: when most tokens receive negative pre-activation scores, the absolute-value denominator becomes dominated by negative contributions, scaling down the few positive scores. This can weaken the signal from values and harm retrieval performance, as illustrated by passkey retrieval experiments where softpick fails to surpass softmax at longer sequence lengths.

The authors explored a Scalable-Softmax–style fix that multiplies the query–key dot product by s·log(n) (with s trainable per head and n the sequence length) before applying softpick, but this did not yield improvements on passkey retrieval and worsened training loss and downstream performance. Consequently, finding an effective remedy for underscoring with long contexts remains unresolved.

References

It remains an open problem to solve this issue, and we will continue to look for solutions.

Softpick: No Attention Sink, No Massive Activations with Rectified Softmax (2504.20966 - Zuhri et al., 29 Apr 2025) in Discussion and Implications — Long Context and The Issue of Underscoring