Entmax Attention: Sparse, Scalable Mapping

Updated 23 April 2026

Entmax attention is a parametric family that transforms logits into sparse probability distributions using Tsallis entropy regularization.
It replaces dense softmax by employing adaptive sparsity that improves interpretability and generalization in attention mechanisms.
It integrates into transformer models efficiently, leveraging modern GPU techniques to boost performance on long-context and variable-size tasks.

Entmax attention is a parametric family of sparse, differentiable transformation functions for mapping vector-valued scores (logits) to probability distributions. It generalizes softmax and sparsemax, enabling learned or adaptive sparsity in attention weights. This provides both theoretical and practical advantages over exclusively dense mechanisms, such as better interpretability, enhanced generalization—particularly for long contexts or variable-size tasks—and computational efficiency as a function of sparsity. The most studied instantiations are α-entmax with α ∈ (1, 2], where α=1 recovers softmax and α=2 yields sparsemax; intermediate α>1 interpolate between the two regimes (Peters et al., 2019).

1. Mathematical Formulation and Key Properties

Given a score vector $z\in\mathbb{R}^d$ and a parameter α>1, the α-entmax mapping is formally defined as the solution to a Tsallis entropy regularized maximization: $\mathrm{entmax}_\alpha(z) := \arg\max_{p\in\Delta^d}\ p^\top z + H_\alpha(p)$ where the simplex is $\Delta^d = \{p\in\mathbb{R}_+^d : \sum_i p_i = 1\}$ , and the Tsallis α-entropy is

$H_\alpha(p) = \frac{1}{\alpha(\alpha-1)} \sum_{i=1}^d (p_i - p_i^\alpha), \,\, \alpha > 1$

This yields the closed-form (componentwise) solution: $[\mathrm{entmax}_\alpha(z)]_i = [(\alpha-1)z_i - \tau(z)]_+^{1/(\alpha-1)}$ where [x]_+ = max(x, 0), and τ(z) is a data-dependent threshold ensuring normalization ( $\sum_{i=1}^d p_i = 1$ ) (Peters et al., 2019, Correia et al., 2019).

Key properties include:

Sparsity: For α>1, entmax assigns exact zeros to low-scoring coordinates, with support size adaptively determined by z and α. Softmax (α=1) is always dense; sparsemax (α=2) is the Euclidean simplex projection; 1.5-entmax is a widely used intermediary (Peters et al., 2019).
Differentiability: The mapping is differentiable except on kinks where entries touch zero. The Jacobian for α-entmax is

$\frac{\partial p}{\partial z} = \mathrm{diag}(s) - \frac{ss^\top}{\sum_i s_i}, \quad s_i = p_i^{2-\alpha}$

for $p = \mathrm{entmax}_\alpha(z)$ (Peters et al., 2019).

Parameter control: Larger α induces sparser solutions (more zero entries); α→∞ tends to the argmax.

2. Algorithmic and Computational Aspects

Calculating the entmax mapping involves identifying the threshold τ(z) such that the output sums to 1. For α=2 (sparsemax), this reduces to a projection onto the simplex via sorting; for general α>1, it requires finding a root of a piecewise-smooth function.

Sorting-based Algorithm: Sort z in descending order and search for the support size k such that a corresponding τ is valid (Peters et al., 2019, Bdeir et al., 2022).
Bisection or Newton/Halley Iteration: Solve for τ via root-finding. Bisection is robust but can be slow; Halley’s method (third-order) accelerates convergence (Gonçalves et al., 17 Feb 2025).

Complexity: | Operation | Softmax | Sparsemax / Entmax | |------------------------------- |--------------- |----------------------------| | Forward pass | O(d) | O(d log d) (bisection/sort), O(d) for α-ReLU (Tezekbayev et al., 2021) | | Backward (Jacobian) | O(d²⁾ | O(k^2), k=support size | | GPU-optimized variants | Highly optimized: e.g., FlashAttention-2 | AdaSplash-2 matches or exceeds softmax in high-sparsity regimes (Gonçalves et al., 16 Apr 2026) |

Recent accelerations, such as AdaSplash and AdaSplash-2, introduce histogram-based initialization and bitpacked block masks to bring entmax’s step time and memory usage in line with FlashAttention-2, especially at moderate/high sparsity or long context (Gonçalves et al., 17 Feb 2025, Gonçalves et al., 16 Apr 2026). α-ReLU replaces τ(z) with a fixed offset for O(d) complexity, trading adaptive normalization for fixed sparsity (Tezekbayev et al., 2021).

3. Interpretations: Kernel-theoretic and Statistical Viewpoints

Entmax encompasses both an optimization-based and a kernel-regression interpretation:

Optimization View: Entmax solves a Fenchel–Young maximization balancing linear scoring with Tsallis-α regularization, leading to sparsity as an emergent property (Peters et al., 2019, Correia et al., 2019).
Kernel Regression View: Entmax attention can be interpreted as Nadaraya–Watson regression using rectified polynomial kernels. Specifically, α-entmax with α=1+1/r corresponds to the r-th order polynomial kernel, e.g. Epanechnikov (r=1, α=2), biweight, triweight, while softmax recovers the Gaussian kernel as r→∞ (α→1) (Santos et al., 30 Jan 2026). This compact-support property explains why entmax attention can focus on meaningful, fixed-size patterns and avoid mass dispersion.

The kernel-theoretic perspective provides a principled alternative to heuristic top-k truncation and clarifies the relationship between regularization-induced sparsity and the geometry of attention (Santos et al., 30 Jan 2026).

4. Integration in Transformer and Sequence Models

Entmax is a drop-in replacement for softmax in multi-head scaled dot-product attention. The only change is to normalize each query’s score vector with entmax_α rather than softmax.

Plugging Entmax: Replace $\text{probs} = \mathrm{softmax}(\mathrm{attn\_scores})$ with $\text{probs} = \mathrm{entmax}_\alpha(\mathrm{attn\_scores})$ (Peters et al., 2019, Correia et al., 2019).
Gradient Propagation: No additional modifications are required; the chain rule naturally incorporates entmax’s sparse, block-diagonal Jacobian (Peters et al., 2019, Martins et al., 2020).
Adaptive α: α can be learned per head using a parameterized sigmoid function, facilitating heterogeneous head sparsity and task-specific specialization (Correia et al., 2019).

Empirical results show that replacing softmax with α-entmax yields sparser alignments, higher head diversity, improved interpretability, and more pronounced specialization for different attention heads, without degrading task-level accuracy (e.g., BLEU scores in machine translation). Adaptive α allows each head to interpolate between dense and sparse regimes as needed (Correia et al., 2019, Vasylenko et al., 19 Jun 2025).

5. Empirical Results and Applications

Entmax attention has demonstrated benefits across a range of neural sequence learning and reasoning tasks:

Sequence-to-sequence models: Enhanced accuracy in both high/medium-resource settings for morphological inflection and translation (up to ≈1 BLEU gain or more, and much sparser outputs than softmax) (Peters et al., 2019).
Transformer-based architectures: Substantially improved generalization on long-context reasoning, reasoning with fixed-size patterns, and variable-length problems—mitigating representational collapse observed with softmax-based dense attention (Vasylenko et al., 19 Jun 2025).
Vision, audio, and continuous-domain tasks: Entmax generalizes to continuous attention densities, yielding compactly supported, interpretable focus regions in CNN grids and audio sequences (Martins et al., 2020, Martins et al., 2021).
Combinatorial optimization: Sparse attention via α-entmax improves generalization in vehicle routing, especially for mixed-size or distribution-shifted inputs (Bdeir et al., 2022).
Efficiency and scalability: Efficient entmax kernels (AdaSplash, AdaSplash-2) match or surpass the throughput of FlashAttention-2 at moderate and high sparsity, enabling long-context settings without loss of accuracy (Gonçalves et al., 17 Feb 2025, Gonçalves et al., 16 Apr 2026).
Sparsity prediction: Methods such as Sparsefinder allow predicting entmax’s support mask, conferring further efficiency advantages (Treviso et al., 2021).

6. Practical Implementations and Trade-Offs

Entmax’s utility depends on both modeling and computational trade-offs:

Hyperparameters: α≈1.5 offers a strong tradeoff between sparsity and smoothness; adaptive α per head yields maximal head diversity and task specialization (Correia et al., 2019, Peters et al., 2019). For maximal speed, α-ReLU enables O(d) implementations with fixed τ (Tezekbayev et al., 2021).
Loss functions: The Fenchel–Young loss paired with entmax admits stable gradients and is necessary for optimal performance (Peters et al., 2019, Tezekbayev et al., 2021).
Integration: Entmax can be introduced mid-training (e.g., converting softmax pretrain to entmax with 2–5B tokens of continual pretraining), with no degradation or even improvements in downstream task performance (Gonçalves et al., 16 Apr 2026).
Attention mask efficiency: With block-mask strategies, the computational cost is proportional to the number of nonzero entries, yielding substantial runtime/memory improvements at >60% sparsity (Gonçalves et al., 17 Feb 2025, Gonçalves et al., 16 Apr 2026).
Interpretability: Entmax supports exact zeros, enabling exact certificate-of-optimality in beam search and explainable attention profiles (Peters et al., 2019).
Limitations: Computing the normalization τ remains the core computational bottleneck. While mitigated by modern GPU kernels (histogram-based, Halley-accelerated, block-masked), dense contexts or very low sparsity offer less speedup over softmax, and certain model/hardware couplings are not yet fully exploited (Gonçalves et al., 17 Feb 2025, Gonçalves et al., 16 Apr 2026). Outputs do not maintain strict probabilistic interpretation when using fixed τ (α-ReLU) in intermediate layers (Tezekbayev et al., 2021).

7. Extensions, Variants, and Current Frontiers

Several variants of the entmax principle address practical challenges and open research directions:

α-ReLU: Replaces τ(z) with a fixed offset, yielding O(d) runtime and hard thresholding, at the expense of dynamic probability normalization (Tezekbayev et al., 2021).
ASEntmax: Adaptive-Scalable Entmax introduces query- and head-dependent temperature for context-length scaling, enabling heads to interpolate between highly sparse and dense patterns as a function of sequence length (Vasylenko et al., 19 Jun 2025).
Continuous-domain entmax: Attention over infinite/continuous domains (e.g., time intervals, image regions) is supported by measures induced by Tsallis statistics, bridging to generalized exponential families (Martins et al., 2020, Martins et al., 2021).
Block-masked entmax kernels: AdaSplash and AdaSplash-2 realize practical, high-throughput sparse attention at scale on modern hardware, enabling long-context language modeling and in-context learning (Gonçalves et al., 17 Feb 2025, Gonçalves et al., 16 Apr 2026).
Kernel regression viewpoint: Links entmax attention to polynomial kernels with compact support suggests principled design of sparse architectures based on statistical properties (Santos et al., 30 Jan 2026).
Sparsity prediction: Sparsefinder predicts the support of entmax without materializing the dense matrix, providing further runtime and Pareto efficiency relative to alternative sparse attention approximations (Treviso et al., 2021).

Notable frontiers include dynamic α learning, further hardware–kernel co-design, long-range generalization, efficient attention for structured and multimodal data types, and broader application to continuous and unstructured data.

Entmax attention constitutes a theoretically principled and practically robust alternative to conventional softmax-based attention, offering adaptive, interpretable sparsity with efficient differentiation and substantial empirical gains in data efficiency, generalization, and scalability across multiple modalities and model classes (Peters et al., 2019, Correia et al., 2019, Gonçalves et al., 17 Feb 2025, Gonçalves et al., 16 Apr 2026, Vasylenko et al., 19 Jun 2025, Santos et al., 30 Jan 2026, Tezekbayev et al., 2021, Bdeir et al., 2022, Treviso et al., 2021, Martins et al., 2020, Martins et al., 2021).