Resolve underscoring of softpick attention scores in long-context settings
Determine how to modify or augment the softpick attention normalization (which uses ReLU(exp(x_i) − 1) in the numerator and sums |exp(x_j) − 1| over tokens in the denominator) to prevent underscaled attention weights when sequence length is large and the attention pattern is sparse, a regime where many negative-scored tokens in the denominator suppress the few positive attention scores and degrade long-context retrieval performance (e.g., passkey retrieval).
References
It remains an open problem to solve this issue, and we will continue to look for solutions.
— Softpick: No Attention Sink, No Massive Activations with Rectified Softmax
(2504.20966 - Zuhri et al., 29 Apr 2025) in Discussion and Implications — Long Context and The Issue of Underscoring