Sparsemax Pointwise Attention in Neural Networks

Updated 2 April 2026

Sparsemax pointwise attention projects raw scores onto the probability simplex, producing sparse outputs with exact zeros for effective hard selection.
It reduces computational load by propagating gradients only through active elements, offering efficiency improvements compared to softmax.
Empirical studies in translation, classification, and speaker verification reveal enhanced interpretability and performance over traditional dense attention.

Sparsemax pointwise attention refers to a family of attention transformations in neural networks where, at each application, the transformation maps an unnormalized score vector to a sparse probability distribution. Unlike the canonical softmax, which assigns nonzero weights to all positions, sparsemax outputs have exact zeros—performing hard selection of the most salient elements. This property leads to improved interpretability, potential efficiency gains, and, when combined with additional constraints or structure, it can enable coverage control and better model behavior in various applications.

1. Definition and Mathematical Formulation

Sparsemax is the Euclidean projection of a real-valued vector onto the probability simplex. Given an attention score vector $z \in \mathbb{R}^K$ , the sparsemax mapping is

$\text{sparsemax}(z) = \arg\min_{p \in \Delta^{K-1}} \|p - z\|^2, \quad \text{where } \Delta^{K-1} = \{p \in \mathbb{R}^K: p_i \geq 0,\, \sum_{i=1}^K p_i = 1\}.$

The closed-form for each coordinate is

$[\text{sparsemax}(z)]_i = \max\{0,\, z_i - \tau(z)\},$

where $\tau(z)$ is a threshold chosen so $\sum_i \max\{0, z_i - \tau(z)\} = 1$ . Computation of $\tau(z)$ requires sorting $z$ and identifying the largest $k$ for which $z_{(k)} - \frac{1}{k}\left(\sum_{j=1}^k z_{(j)} - 1\right) > 0$ with $z_{(1)} \ge \dots \ge z_{(K)}$ ; then $\text{sparsemax}(z) = \arg\min_{p \in \Delta^{K-1}} \|p - z\|^2, \quad \text{where } \Delta^{K-1} = \{p \in \mathbb{R}^K: p_i \geq 0,\, \sum_{i=1}^K p_i = 1\}.$ 0 (Martins et al., 2016, Martins et al., 2020).

This hard-projection property yields sparse outputs: only the most significant entries in $\text{sparsemax}(z) = \arg\min_{p \in \Delta^{K-1}} \|p - z\|^2, \quad \text{where } \Delta^{K-1} = \{p \in \mathbb{R}^K: p_i \geq 0,\, \sum_{i=1}^K p_i = 1\}.$ 1 contribute nonzero mass, with the number of nonzeros determined adaptively by the support threshold.

2. Forward, Backward, and Computational Properties

The typical workflow for sparsemax pointwise attention is:

Compute raw attention scores (e.g., via a dot product $\text{sparsemax}(z) = \arg\min_{p \in \Delta^{K-1}} \|p - z\|^2, \quad \text{where } \Delta^{K-1} = \{p \in \mathbb{R}^K: p_i \geq 0,\, \sum_{i=1}^K p_i = 1\}.$ 2 or any other scoring function).
Apply sparsemax to obtain normalized weights.
Form the weighted sum of values (attended representation).

Forward Pass: Sorting dominates the computational complexity at $\text{sparsemax}(z) = \arg\min_{p \in \Delta^{K-1}} \|p - z\|^2, \quad \text{where } \Delta^{K-1} = \{p \in \mathbb{R}^K: p_i \geq 0,\, \sum_{i=1}^K p_i = 1\}.$ 3, but selection algorithms can achieve $\text{sparsemax}(z) = \arg\min_{p \in \Delta^{K-1}} \|p - z\|^2, \quad \text{where } \Delta^{K-1} = \{p \in \mathbb{R}^K: p_i \geq 0,\, \sum_{i=1}^K p_i = 1\}.$ 4 expected time, especially practical for moderate $\text{sparsemax}(z) = \arg\min_{p \in \Delta^{K-1}} \|p - z\|^2, \quad \text{where } \Delta^{K-1} = \{p \in \mathbb{R}^K: p_i \geq 0,\, \sum_{i=1}^K p_i = 1\}.$ 5.

Backward Pass: The gradient (Jacobian) with respect to the input scores is sparse. For support $\text{sparsemax}(z) = \arg\min_{p \in \Delta^{K-1}} \|p - z\|^2, \quad \text{where } \Delta^{K-1} = \{p \in \mathbb{R}^K: p_i \geq 0,\, \sum_{i=1}^K p_i = 1\}.$ 6: $\text{sparsemax}(z) = \arg\min_{p \in \Delta^{K-1}} \|p - z\|^2, \quad \text{where } \Delta^{K-1} = \{p \in \mathbb{R}^K: p_i \geq 0,\, \sum_{i=1}^K p_i = 1\}.$ 7 Consequently, during backpropagation, nonzero gradients are propagated only along the non-pruned positions—yielding potential reductions in computational and memory requirements (Martins et al., 2016, Martins et al., 2020).

Comparison with Softmax: Softmax is $\text{sparsemax}(z) = \arg\min_{p \in \Delta^{K-1}} \|p - z\|^2, \quad \text{where } \Delta^{K-1} = \{p \in \mathbb{R}^K: p_i \geq 0,\, \sum_{i=1}^K p_i = 1\}.$ 8 (no sort), always dense, and differentiable everywhere. Sparsemax is piecewise linear, 1-Lipschitz, and has sublinear backward cost in the support size (Niculae et al., 2017, Martins et al., 2020).

3. Extensions: Constrained and Structured Sparsemax

Sparsemax can be extended to include upper bounds on the probability mass available per position—addressing coverage and fertility constraints. The constrained sparsemax (csparsemax) maps $\text{sparsemax}(z) = \arg\min_{p \in \Delta^{K-1}} \|p - z\|^2, \quad \text{where } \Delta^{K-1} = \{p \in \mathbb{R}^K: p_i \geq 0,\, \sum_{i=1}^K p_i = 1\}.$ 9 and an upper-bound vector $[\text{sparsemax}(z)]_i = \max\{0,\, z_i - \tau(z)\},$ 0: $[\text{sparsemax}(z)]_i = \max\{0,\, z_i - \tau(z)\},$ 1 The solution remains unique and admits a thresholded form

$[\text{sparsemax}(z)]_i = \max\{0,\, z_i - \tau(z)\},$ 2

with the scalar $[\text{sparsemax}(z)]_i = \max\{0,\, z_i - \tau(z)\},$ 3 chosen such that $[\text{sparsemax}(z)]_i = \max\{0,\, z_i - \tau(z)\},$ 4. Rigorous optimization procedures, e.g., median-finding algorithms for breakpoints, enable $[\text{sparsemax}(z)]_i = \max\{0,\, z_i - \tau(z)\},$ 5-time implementations (Malaviya et al., 2018).

For structured attention, total-variation penalties (e.g., TVmax) are added to promote group or segment sparsity. The solution combines a proximal step for structure (e.g., spatial contiguity) and projection onto the simplex, composing sparsemax with structured sparsity (Martins et al., 2020, Niculae et al., 2017).

4. Implementation in Neural Attention Models

Sparsemax is a drop-in replacement for softmax in pointwise (dot-product) attention. Formally, for queries $[\text{sparsemax}(z)]_i = \max\{0,\, z_i - \tau(z)\},$ 6, keys $[\text{sparsemax}(z)]_i = \max\{0,\, z_i - \tau(z)\},$ 7, values $[\text{sparsemax}(z)]_i = \max\{0,\, z_i - \tau(z)\},$ 8: $[\text{sparsemax}(z)]_i = \max\{0,\, z_i - \tau(z)\},$ 9 In hierarchical, multi-head, or convolutional attention, sparsemax can be applied at word-level, channel-level, or over spatial regions. In constrained variants, fertility vectors are updated online based on previous attentions, absorbing attention excess in a sink node to manage normalization (Malaviya et al., 2018, Liang et al., 2021, Ribeiro et al., 2020).

Practical deep learning libraries have native support for sparsemax; otherwise, implementation requires careful sort/thresh routines and O( $\tau(z)$ 0) backward passes.

5. Empirical Applications and Behavior

Sparsemax-based pointwise attention has been applied in several settings:

Machine Translation: Constrained sparsemax incorporated into NMT prevents repeated/undertranslation. On De–En, csparsemax attains BLEU ≈ 29.85 (vs. softmax ≈ 29.51), lower REP (2.67 vs. 3.37), and lower DROP (5.23% vs. 5.89%) (Malaviya et al., 2018).
Text Classification: In sentiment classification, sparsemax-attention models induce ∼40% word-level sparsity, achieving slightly lower accuracy than softmax but with increased interpretability (Ribeiro et al., 2020).
Speaker Verification: In ad-hoc arrays, sparsemax excels at removing weights from noisy microphone channels, leading to 3–6% relative EER gains over already strong softmax-attention baselines (Liang et al., 2021).
Visual Question Answering: Sparsemax attention on spatial grids selects only the most relevant regions, improving interpretability and matching human-like attention patterns (Martins et al., 2020).
Kernel Regression Perspective: There is a formal correspondence between sparsemax and Epanechnikov kernel regression with adaptive normalization. Sparse attention mechanisms generalize principle alternatives to heuristic top- $\tau(z)$ 1 selection (Santos et al., 30 Jan 2026).

A notable property is that sparsemax produces "hard" sparsity—only the top- $\tau(z)$ 2 entries per query receive nonzero mass, with $\tau(z)$ 3 determined textually. This contrasts with softmax, where all entries are nonzero and often difficult to interpret.

6. Theoretical Foundations and Generalizations

Sparsemax is a special case of the $\tau(z)$ 4-entmax family (specifically, $\tau(z)$ 5), which interpolates between softmax ( $\tau(z)$ 6) and polynomially sparse activations (e.g., biweight at $\tau(z)$ 7). The general $\tau(z)$ 8-entmax has closed-form: $\tau(z)$ 9 Sparsemax realizes the $\sum_i \max\{0, z_i - \tau(z)\} = 1$ 0-projection; its support adapts to the input nonlinearly, producing hard zeros (Correia et al., 2019, Martins et al., 2020). This yields a bridge between convex duality, Tsallis entropy regularization, and deformed exponential families.

Furthermore, from a variational viewpoint, sparsemax is the gradient of the maximization of a linear form minus squared $\sum_i \max\{0, z_i - \tau(z)\} = 1$ 1 norm over the simplex—a smoothed approximation of the max-operator distinct from softmax, whose smoothing is through negative entropy (Niculae et al., 2017).

Sparsemax enables direct inspection of which tokens, channels, or regions the model "attended to," since zeros in the output map correspond to pruned, ignored locations. This effect improves transparency especially in structured or explainable settings.

However, sparsemax can at times collapse multimodal attention into single-mode, depending on the input. Other variants, such as MultiMax (Zhou et al., 2024), introduce piecewise-linear modulators to achieve a trade-off between sparsity and multimodality, and $\sum_i \max\{0, z_i - \tau(z)\} = 1$ 2-entmax allows for soft interpolation between behaviors.

Sparsemax can be further modified for additional constraints (e.g., upper bounds, structured penalties) or generalized to continuous domains, in which case density functions respect support constraints analogous to their discrete counterparts (Martins et al., 2020).

References:

(Martins et al., 2016) "From Softmax to Sparsemax: A Sparse Model of Attention and Multi-Label Classification"
(Martins et al., 2020) "Sparse and Structured Visual Attention"
(Niculae et al., 2017) "A Regularized Framework for Sparse and Structured Neural Attention"
(Malaviya et al., 2018) "Sparse and Constrained Attention for Neural Machine Translation"
(Correia et al., 2019) "Adaptively Sparse Transformers"
(Liang et al., 2021) "Attention-based multi-channel speaker verification with ad-hoc microphone arrays"
(Martins et al., 2020) "Sparse and Continuous Attention Mechanisms"
(Ribeiro et al., 2020) "Pruning and Sparsemax Methods for Hierarchical Attention Networks"
(Santos et al., 30 Jan 2026) "Sparse Attention as Compact Kernel Regression"
(Zhou et al., 2024) "MultiMax: Sparse and Multi-Modal Attention Learning"