AdaSplash-2: Hardware-Aware Sparse Attention

Updated 23 April 2026

The paper introduces AdaSplash-2, a hardware-aware implementation of differentiable sparse attention using the α-entmax transformation to eliminate quadratic bottlenecks.
It employs a novel histogram-based initialization for rapid normalization, reducing root-finding iterations and enhancing on-chip efficiency.
The sparsity-aware GPU pipeline exploits block sparsity to achieve up to 2× faster performance than FlashAttention-2 in long-context, high-sparsity scenarios.

AdaSplash-2 is a hardware-aware implementation of differentiable sparse attention based on the $\alpha$ -entmax transformation, targeting the elimination of the quadratic computational bottleneck in long-context transformer models. By introducing a novel histogram-based initialization for the entmax normalization root and a GPU kernel that efficiently exploits block sparsity, AdaSplash-2 achieves competitive or superior runtimes compared to FlashAttention-2 in settings where attention is highly sparse. This method demonstrates its effectiveness both in synthetic benchmarks and large-scale language modeling tasks, where it not only matches softmax-based baselines on short contexts but also realizes significant gains as input lengths and sparsity increase (Gonçalves et al., 16 Apr 2026).

1. $\alpha$ -entmax Attention and Motivation

Standard softmax-based attention, defined for scores $s\in\mathbb{R}^n$ by

$\text{softmax}(s) = \exp(s-\tau\mathbf{1}), \qquad \tau = \log\sum_j \exp(s_j),$

assigns nonzero mass to all tokens, inducing $O(n^2)$ work per layer and encouraging distributed, often diffuse, attention which can impede learning in long-context settings.

$\alpha$ -entmax attention [Peters et al. 2019] generalizes softmax and sparsemax by allowing a tunable entropic regularization through the Tsallis entropy:

$\alpha\text{-entmax}(s) := \arg\max_{p\in\Delta_n}~p^\top s + H_\alpha(p), \qquad \Delta_n = \{p \geq 0,\ \mathbf{1}^\top p = 1\},$

leading to the closed-form solution

$\mathrm{entmax}_\alpha(s) = \left[(\alpha-1)s - \tau\mathbf{1}\right]_+^{1/(\alpha-1)},$

subject to $\sum_i \mathrm{entmax}_\alpha(s)_i = 1$ and $[x]_+ = \max(0, x)$ . The normalizer $\alpha$ 0 is found by solving the root of

$\alpha$ 1

A key property is input-dependent sparsity: for each $\alpha$ 2, entmax assigns exact zeros wherever $\alpha$ 3, generating probability vectors with adaptive support. This behavior allows attention computation and memory usage to scale with the true, contextual support size rather than the full $\alpha$ 4 space, addressing both computational and representational inefficiencies in long-context transformers.

2. Histogram-Based Normalizer Initialization

A practical challenge for $\alpha$ 5-entmax layers is the efficient solution of $\alpha$ 6 per row. Traditional root-finding methods such as bisection are robust but converge slowly, whereas Halley or Newton methods are fast but require a good starting point.

AdaSplash-2 introduces a hardware-friendly histogram-based initialization that stores a binned summary of transformed scores in on-chip SRAM. The method comprises:

Centering scores as $\alpha$ 7 with $\alpha$ 8, normalizing $\alpha$ 9 so $s\in\mathbb{R}^n$ 0 and ensuring $s\in\mathbb{R}^n$ 1.
Discretizing $s\in\mathbb{R}^n$ 2 into $s\in\mathbb{R}^n$ 3 bins of width $s\in\mathbb{R}^n$ 4 and assigning each $s\in\mathbb{R}^n$ 5 to its appropriate bin.
Constructing a histogram $s\in\mathbb{R}^n$ 6 where $s\in\mathbb{R}^n$ 7 counts the number of $s\in\mathbb{R}^n$ 8 falling into each bin.
Approximating the normalizer by replacing $s\in\mathbb{R}^n$ 9 with its bin's left edge in the normalizer equation, yielding a reduced monotone root-finding problem:

$\text{softmax}(s) = \exp(s-\tau\mathbf{1}), \qquad \tau = \log\sum_j \exp(s_j),$ 0

By mathematical proposition, the root $\text{softmax}(s) = \exp(s-\tau\mathbf{1}), \qquad \tau = \log\sum_j \exp(s_j),$ 1 of $\text{softmax}(s) = \exp(s-\tau\mathbf{1}), \qquad \tau = \log\sum_j \exp(s_j),$ 2 provides a lower bound within $\text{softmax}(s) = \exp(s-\tau\mathbf{1}), \qquad \tau = \log\sum_j \exp(s_j),$ 3 of the exact $\text{softmax}(s) = \exp(s-\tau\mathbf{1}), \qquad \tau = \log\sum_j \exp(s_j),$ 4: $\text{softmax}(s) = \exp(s-\tau\mathbf{1}), \qquad \tau = \log\sum_j \exp(s_j),$ 5.

A single safeguar ded hybrid root-finding step (Halley for $\text{softmax}(s) = \exp(s-\tau\mathbf{1}), \qquad \tau = \log\sum_j \exp(s_j),$ 6, Newton if $\text{softmax}(s) = \exp(s-\tau\mathbf{1}), \qquad \tau = \log\sum_j \exp(s_j),$ 7, secant for $\text{softmax}(s) = \exp(s-\tau\mathbf{1}), \qquad \tau = \log\sum_j \exp(s_j),$ 8, fallback to bisection if needed) refines $\text{softmax}(s) = \exp(s-\tau\mathbf{1}), \qquad \tau = \log\sum_j \exp(s_j),$ 9 to the true root, typically converging within 1–2 passes over the data. The histogram method requires only $O(n^2)$ 0 words of fast on-chip memory and substantially accelerates normalization compared to standard techniques.

3. Sparsity-Aware GPU Pipeline

AdaSplash-2 is implemented as a Triton GPU kernel organized into four key phases per query block $O(n^2)$ 1 (of shape $O(n^2)$ 2) over key blocks $O(n^2)$ 3 ( $O(n^2)$ 4):

Row Maximum Computation: Compute $O(n^2)$ 5 per query block.
Histogram Construction: For each tile $O(n^2)$ 6 vs $O(n^2)$ 7, scale the score tile to $O(n^2)$ 8, bin indices, and build bit-packed local histograms of shape $O(n^2)$ 9 in SRAM.
$\alpha$ 0 Refinement and Block Masking: Solve for $\alpha$ 1 using special-case or general histogram solvers; refine to final $\alpha$ 2 with a hybrid step; simultaneously, build bit-packed masks $\alpha$ 3 per block, indicating which blocks contain nonzero attention.
Sparse MatMul: Using the mask, load only nonzero key and value blocks; GPU native population-count instructions enable efficient traversal, accumulating $\alpha$ 4 for nonzero attention blocks.

The computational complexity scales with nonzero block fraction: in the worst-case, $\alpha$ 5 (as for dense attention), but the actual work is proportional to $\alpha$ 6, where $\alpha$ 7 is block sparsity. Histogram and tile management overhead is $\alpha$ 8, which is negligible when $\alpha$ 9. At high sparsity ( $\alpha\text{-entmax}(s) := \arg\max_{p\in\Delta_n}~p^\top s + H_\alpha(p), \qquad \Delta_n = \{p \geq 0,\ \mathbf{1}^\top p = 1\},$ 0), especially for long-contexts ( $\alpha\text{-entmax}(s) := \arg\max_{p\in\Delta_n}~p^\top s + H_\alpha(p), \qquad \Delta_n = \{p \geq 0,\ \mathbf{1}^\top p = 1\},$ 1), backward passes are up to $\alpha\text{-entmax}(s) := \arg\max_{p\in\Delta_n}~p^\top s + H_\alpha(p), \qquad \Delta_n = \{p \geq 0,\ \mathbf{1}^\top p = 1\},$ 2 faster than FlashAttention-2.

4. Empirical Results

AdaSplash-2 was evaluated on NVIDIA A6000 and H100 GPUs using Triton-based kernels, with baselines including CUDA/Triton FlashAttention-2 ("FA2"). Synthetic and language modeling experiments were conducted:

Root-Finder Evaluation: For $\alpha\text{-entmax}(s) := \arg\max_{p\in\Delta_n}~p^\top s + H_\alpha(p), \qquad \Delta_n = \{p \geq 0,\ \mathbf{1}^\top p = 1\},$ 3 sampled scores $\alpha\text{-entmax}(s) := \arg\max_{p\in\Delta_n}~p^\top s + H_\alpha(p), \qquad \Delta_n = \{p \geq 0,\ \mathbf{1}^\top p = 1\},$ 4, histogram initialization with $\alpha\text{-entmax}(s) := \arg\max_{p\in\Delta_n}~p^\top s + H_\alpha(p), \qquad \Delta_n = \{p \geq 0,\ \mathbf{1}^\top p = 1\},$ 5 drastically reduces normalizer error $\alpha\text{-entmax}(s) := \arg\max_{p\in\Delta_n}~p^\top s + H_\alpha(p), \qquad \Delta_n = \{p \geq 0,\ \mathbf{1}^\top p = 1\},$ 6 to $\alpha\text{-entmax}(s) := \arg\max_{p\in\Delta_n}~p^\top s + H_\alpha(p), \qquad \Delta_n = \{p \geq 0,\ \mathbf{1}^\top p = 1\},$ 7 after only one iteration.
Sparsity-Sensitivity: For causal attention with $\alpha\text{-entmax}(s) := \arg\max_{p\in\Delta_n}~p^\top s + H_\alpha(p), \qquad \Delta_n = \{p \geq 0,\ \mathbf{1}^\top p = 1\},$ 8, at block sparsity $\alpha\text{-entmax}(s) := \arg\max_{p\in\Delta_n}~p^\top s + H_\alpha(p), \qquad \Delta_n = \{p \geq 0,\ \mathbf{1}^\top p = 1\},$ 9, AdaSplash-2 outperforms FA2 by $\mathrm{entmax}_\alpha(s) = \left[(\alpha-1)s - \tau\mathbf{1}\right]_+^{1/(\alpha-1)},$ 0 and achieves $\mathrm{entmax}_\alpha(s) = \left[(\alpha-1)s - \tau\mathbf{1}\right]_+^{1/(\alpha-1)},$ 1 speedup at $\mathrm{entmax}_\alpha(s) = \left[(\alpha-1)s - \tau\mathbf{1}\right]_+^{1/(\alpha-1)},$ 2.
Context Scaling: Using block sparsity patterns extracted from a 1B $\mathrm{entmax}_\alpha(s) = \left[(\alpha-1)s - \tau\mathbf{1}\right]_+^{1/(\alpha-1)},$ 3-entmax-NAPE LM, backward speedups emerge even at $\mathrm{entmax}_\alpha(s) = \left[(\alpha-1)s - \tau\mathbf{1}\right]_+^{1/(\alpha-1)},$ 4 (with $\mathrm{entmax}_\alpha(s) = \left[(\alpha-1)s - \tau\mathbf{1}\right]_+^{1/(\alpha-1)},$ 5 sparsity); step time surpasses FA2 beyond $\mathrm{entmax}_\alpha(s) = \left[(\alpha-1)s - \tau\mathbf{1}\right]_+^{1/(\alpha-1)},$ 6 length.
Large-Scale Language Modeling: LLaMA-3 models (350M, 1B) trained on $\mathrm{entmax}_\alpha(s) = \left[(\alpha-1)s - \tau\mathbf{1}\right]_+^{1/(\alpha-1)},$ 7B DCLM-Edu tokens ( $\mathrm{entmax}_\alpha(s) = \left[(\alpha-1)s - \tau\mathbf{1}\right]_+^{1/(\alpha-1)},$ 8 context, bf16 precision). At short context ( $\mathrm{entmax}_\alpha(s) = \left[(\alpha-1)s - \tau\mathbf{1}\right]_+^{1/(\alpha-1)},$ 9K), entmax+NAPE obtains best average scores: 48.1 (350M) vs. 47.3 (softmax+RoPE) and 47.1 (softmax+NAPE); 1B model: ppl 11.42 vs 11.97 (softmax+NAPE), avg accuracy 53.1 vs 53.0. On long-context tasks (RULER at up to 32K), entmax+NAPE outperforms softmax variants by +2–6 points average and +2.2 avg at 32K for HELMET ICL.

5. Limitations, Trade-offs, and Practical Considerations

While AdaSplash-2 achieves significant speedups for backward propagation in high-sparsity regimes, its forward pass is slower than FA2 for dense attention due to histogram management overhead. However, this gap narrows as block sparsity increases above 30%. Notably, while $\sum_i \mathrm{entmax}_\alpha(s)_i = 1$ 0-entmax enables dynamic, differentiable sparsity, current kernels still require scanning all keys at inference time; highly efficient inference kernels remain an open engineering challenge.

The histogram initialization scheme requires that $\sum_i \mathrm{entmax}_\alpha(s)_i = 1$ 1 fit per-row in SRAM, which becomes a constraint for extremely long sequences. To address this, AdaSplash-2 incorporates an overflow handling scheme for periodic histogram flushing. The hybrid solver’s refinement still necessitates a secondary pass over scores, although this could potentially be fused with the sparse matmul to improve efficiency.

6. Scenarios of Maximal Benefit and Future Directions

AdaSplash-2 is particularly advantageous in:

Long-context transformer training where sparsity emerges organically (e.g., document-level QA, generative modeling at scale).
Context lengths of 8K–32K, where block sparsity greater than 60% is commonly observed early in training.
Tasks where static patterns or top- $\sum_i \mathrm{entmax}_\alpha(s)_i = 1$ 2 sparsity baselines are surpassed by differentiable, dynamic sparsity.

Future research and engineering directions include: (i) fused inference kernels aligning entmax computation with key-value retrieval, (ii) mixed-precision and hardware-specific optimizations (e.g., NVIDIA Hopper TMA/TMAMMA), (iii) adapting the $\sum_i \mathrm{entmax}_\alpha(s)_i = 1$ 3 parameter per head or per layer, and (iv) extending techniques to encoder–decoder and cross-attention modules.

By integrating rapid, provable initialization and fine-tuned GPU kernels, AdaSplash-2 delivers expressive differentiable sparse attention for large-scale models, achieving or exceeding FlashAttention-2 speed in moderate and high sparsity settings and providing robust generalization for both short and long-context tasks (Gonçalves et al., 16 Apr 2026).

Markdown Report Issue Upgrade to Chat

References (1)

AdaSplash-2: Faster Differentiable Sparse Attention (2026)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to AdaSplash-2.

AdaSplash-2: Hardware-Aware Sparse Attention

1. $\alpha$ -entmax Attention and Motivation

2. Histogram-Based Normalizer Initialization

3. Sparsity-Aware GPU Pipeline

4. Empirical Results

5. Limitations, Trade-offs, and Practical Considerations

6. Scenarios of Maximal Benefit and Future Directions

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Don't miss out on important new AI/ML research

AdaSplash-2: Hardware-Aware Sparse Attention

1. α\alphaα-entmax Attention and Motivation

2. Histogram-Based Normalizer Initialization

3. Sparsity-Aware GPU Pipeline

4. Empirical Results

5. Limitations, Trade-offs, and Practical Considerations

6. Scenarios of Maximal Benefit and Future Directions

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Related Topics

Don't miss out on important new AI/ML research

Sign up for free to explore the frontiers of research

1. $\alpha$ -entmax Attention and Motivation