AdaSplash: GPU-Efficient Adaptive Sparse Attention
- AdaSplash is a family of GPU-efficient adaptive sparse attention algorithms that employs the α-entmax transformation to enhance scalability and accuracy in transformers.
- It integrates hardware-tailored kernels, specialized root-finding solvers, and bitpacked block masking to achieve superior throughput compared to prior approaches.
- AdaSplash-2 introduces a histogram-based initialization scheme, effectively addressing both algorithmic and system challenges posed by adaptive, input-dependent sparsity.
AdaSplash is a family of GPU-efficient adaptive sparse attention algorithms for transformers, centered on high-performance implementations of the α-entmax family of attention mechanisms. AdaSplash methods address both algorithmic and systems challenges posed by adaptive, input-dependent sparsity in attention, surpassing prior α-entmax implementations in efficiency, scale, and integration with end-to-end transformer training. The approach leverages specialized root-finding solvers, hardware-tailored kernels, bitpacked block masking, and, in AdaSplash-2, a histogram-based initialization scheme that dramatically accelerates the computation of the entmax normalizer. AdaSplash methods achieve competitive or superior throughput to FlashAttention-2 in moderate-to-high sparsity regimes and maintain accuracy head-to-head with softmax baselines on both short- and long-context benchmarks (Gonçalves et al., 17 Feb 2025, Gonçalves et al., 16 Apr 2026).
1. Mathematical Foundations: α-entmax and Sparse Attention
The α-entmax transformation is a parametric family of differentiable, input-adaptive sparse alternatives to softmax [Peters et al. 2019]. For a score vector , the softmax attention weights are given by
$$
\