Papers
Topics
Authors
Recent
Search
2000 character limit reached

Sparsemax Pointwise Attention in Neural Networks

Updated 2 April 2026
  • Sparsemax pointwise attention projects raw scores onto the probability simplex, producing sparse outputs with exact zeros for effective hard selection.
  • It reduces computational load by propagating gradients only through active elements, offering efficiency improvements compared to softmax.
  • Empirical studies in translation, classification, and speaker verification reveal enhanced interpretability and performance over traditional dense attention.

Sparsemax pointwise attention refers to a family of attention transformations in neural networks where, at each application, the transformation maps an unnormalized score vector to a sparse probability distribution. Unlike the canonical softmax, which assigns nonzero weights to all positions, sparsemax outputs have exact zeros—performing hard selection of the most salient elements. This property leads to improved interpretability, potential efficiency gains, and, when combined with additional constraints or structure, it can enable coverage control and better model behavior in various applications.

1. Definition and Mathematical Formulation

Sparsemax is the Euclidean projection of a real-valued vector onto the probability simplex. Given an attention score vector zRKz \in \mathbb{R}^K, the sparsemax mapping is

sparsemax(z)=argminpΔK1pz2,where ΔK1={pRK:pi0,i=1Kpi=1}.\text{sparsemax}(z) = \arg\min_{p \in \Delta^{K-1}} \|p - z\|^2, \quad \text{where } \Delta^{K-1} = \{p \in \mathbb{R}^K: p_i \geq 0,\, \sum_{i=1}^K p_i = 1\}.

The closed-form for each coordinate is

[sparsemax(z)]i=max{0,ziτ(z)},[\text{sparsemax}(z)]_i = \max\{0,\, z_i - \tau(z)\},

where τ(z)\tau(z) is a threshold chosen so imax{0,ziτ(z)}=1\sum_i \max\{0, z_i - \tau(z)\} = 1. Computation of τ(z)\tau(z) requires sorting zz and identifying the largest kk for which z(k)1k(j=1kz(j)1)>0z_{(k)} - \frac{1}{k}\left(\sum_{j=1}^k z_{(j)} - 1\right) > 0 with z(1)z(K)z_{(1)} \ge \dots \ge z_{(K)}; then sparsemax(z)=argminpΔK1pz2,where ΔK1={pRK:pi0,i=1Kpi=1}.\text{sparsemax}(z) = \arg\min_{p \in \Delta^{K-1}} \|p - z\|^2, \quad \text{where } \Delta^{K-1} = \{p \in \mathbb{R}^K: p_i \geq 0,\, \sum_{i=1}^K p_i = 1\}.0 (Martins et al., 2016, Martins et al., 2020).

This hard-projection property yields sparse outputs: only the most significant entries in sparsemax(z)=argminpΔK1pz2,where ΔK1={pRK:pi0,i=1Kpi=1}.\text{sparsemax}(z) = \arg\min_{p \in \Delta^{K-1}} \|p - z\|^2, \quad \text{where } \Delta^{K-1} = \{p \in \mathbb{R}^K: p_i \geq 0,\, \sum_{i=1}^K p_i = 1\}.1 contribute nonzero mass, with the number of nonzeros determined adaptively by the support threshold.

2. Forward, Backward, and Computational Properties

The typical workflow for sparsemax pointwise attention is:

  1. Compute raw attention scores (e.g., via a dot product sparsemax(z)=argminpΔK1pz2,where ΔK1={pRK:pi0,i=1Kpi=1}.\text{sparsemax}(z) = \arg\min_{p \in \Delta^{K-1}} \|p - z\|^2, \quad \text{where } \Delta^{K-1} = \{p \in \mathbb{R}^K: p_i \geq 0,\, \sum_{i=1}^K p_i = 1\}.2 or any other scoring function).
  2. Apply sparsemax to obtain normalized weights.
  3. Form the weighted sum of values (attended representation).

Forward Pass: Sorting dominates the computational complexity at sparsemax(z)=argminpΔK1pz2,where ΔK1={pRK:pi0,i=1Kpi=1}.\text{sparsemax}(z) = \arg\min_{p \in \Delta^{K-1}} \|p - z\|^2, \quad \text{where } \Delta^{K-1} = \{p \in \mathbb{R}^K: p_i \geq 0,\, \sum_{i=1}^K p_i = 1\}.3, but selection algorithms can achieve sparsemax(z)=argminpΔK1pz2,where ΔK1={pRK:pi0,i=1Kpi=1}.\text{sparsemax}(z) = \arg\min_{p \in \Delta^{K-1}} \|p - z\|^2, \quad \text{where } \Delta^{K-1} = \{p \in \mathbb{R}^K: p_i \geq 0,\, \sum_{i=1}^K p_i = 1\}.4 expected time, especially practical for moderate sparsemax(z)=argminpΔK1pz2,where ΔK1={pRK:pi0,i=1Kpi=1}.\text{sparsemax}(z) = \arg\min_{p \in \Delta^{K-1}} \|p - z\|^2, \quad \text{where } \Delta^{K-1} = \{p \in \mathbb{R}^K: p_i \geq 0,\, \sum_{i=1}^K p_i = 1\}.5.

Backward Pass: The gradient (Jacobian) with respect to the input scores is sparse. For support sparsemax(z)=argminpΔK1pz2,where ΔK1={pRK:pi0,i=1Kpi=1}.\text{sparsemax}(z) = \arg\min_{p \in \Delta^{K-1}} \|p - z\|^2, \quad \text{where } \Delta^{K-1} = \{p \in \mathbb{R}^K: p_i \geq 0,\, \sum_{i=1}^K p_i = 1\}.6: sparsemax(z)=argminpΔK1pz2,where ΔK1={pRK:pi0,i=1Kpi=1}.\text{sparsemax}(z) = \arg\min_{p \in \Delta^{K-1}} \|p - z\|^2, \quad \text{where } \Delta^{K-1} = \{p \in \mathbb{R}^K: p_i \geq 0,\, \sum_{i=1}^K p_i = 1\}.7 Consequently, during backpropagation, nonzero gradients are propagated only along the non-pruned positions—yielding potential reductions in computational and memory requirements (Martins et al., 2016, Martins et al., 2020).

Comparison with Softmax: Softmax is sparsemax(z)=argminpΔK1pz2,where ΔK1={pRK:pi0,i=1Kpi=1}.\text{sparsemax}(z) = \arg\min_{p \in \Delta^{K-1}} \|p - z\|^2, \quad \text{where } \Delta^{K-1} = \{p \in \mathbb{R}^K: p_i \geq 0,\, \sum_{i=1}^K p_i = 1\}.8 (no sort), always dense, and differentiable everywhere. Sparsemax is piecewise linear, 1-Lipschitz, and has sublinear backward cost in the support size (Niculae et al., 2017, Martins et al., 2020).

3. Extensions: Constrained and Structured Sparsemax

Sparsemax can be extended to include upper bounds on the probability mass available per position—addressing coverage and fertility constraints. The constrained sparsemax (csparsemax) maps sparsemax(z)=argminpΔK1pz2,where ΔK1={pRK:pi0,i=1Kpi=1}.\text{sparsemax}(z) = \arg\min_{p \in \Delta^{K-1}} \|p - z\|^2, \quad \text{where } \Delta^{K-1} = \{p \in \mathbb{R}^K: p_i \geq 0,\, \sum_{i=1}^K p_i = 1\}.9 and an upper-bound vector [sparsemax(z)]i=max{0,ziτ(z)},[\text{sparsemax}(z)]_i = \max\{0,\, z_i - \tau(z)\},0: [sparsemax(z)]i=max{0,ziτ(z)},[\text{sparsemax}(z)]_i = \max\{0,\, z_i - \tau(z)\},1 The solution remains unique and admits a thresholded form

[sparsemax(z)]i=max{0,ziτ(z)},[\text{sparsemax}(z)]_i = \max\{0,\, z_i - \tau(z)\},2

with the scalar [sparsemax(z)]i=max{0,ziτ(z)},[\text{sparsemax}(z)]_i = \max\{0,\, z_i - \tau(z)\},3 chosen such that [sparsemax(z)]i=max{0,ziτ(z)},[\text{sparsemax}(z)]_i = \max\{0,\, z_i - \tau(z)\},4. Rigorous optimization procedures, e.g., median-finding algorithms for breakpoints, enable [sparsemax(z)]i=max{0,ziτ(z)},[\text{sparsemax}(z)]_i = \max\{0,\, z_i - \tau(z)\},5-time implementations (Malaviya et al., 2018).

For structured attention, total-variation penalties (e.g., TVmax) are added to promote group or segment sparsity. The solution combines a proximal step for structure (e.g., spatial contiguity) and projection onto the simplex, composing sparsemax with structured sparsity (Martins et al., 2020, Niculae et al., 2017).

4. Implementation in Neural Attention Models

Sparsemax is a drop-in replacement for softmax in pointwise (dot-product) attention. Formally, for queries [sparsemax(z)]i=max{0,ziτ(z)},[\text{sparsemax}(z)]_i = \max\{0,\, z_i - \tau(z)\},6, keys [sparsemax(z)]i=max{0,ziτ(z)},[\text{sparsemax}(z)]_i = \max\{0,\, z_i - \tau(z)\},7, values [sparsemax(z)]i=max{0,ziτ(z)},[\text{sparsemax}(z)]_i = \max\{0,\, z_i - \tau(z)\},8: [sparsemax(z)]i=max{0,ziτ(z)},[\text{sparsemax}(z)]_i = \max\{0,\, z_i - \tau(z)\},9 In hierarchical, multi-head, or convolutional attention, sparsemax can be applied at word-level, channel-level, or over spatial regions. In constrained variants, fertility vectors are updated online based on previous attentions, absorbing attention excess in a sink node to manage normalization (Malaviya et al., 2018, Liang et al., 2021, Ribeiro et al., 2020).

Practical deep learning libraries have native support for sparsemax; otherwise, implementation requires careful sort/thresh routines and O(τ(z)\tau(z)0) backward passes.

5. Empirical Applications and Behavior

Sparsemax-based pointwise attention has been applied in several settings:

  • Machine Translation: Constrained sparsemax incorporated into NMT prevents repeated/undertranslation. On De–En, csparsemax attains BLEU ≈ 29.85 (vs. softmax ≈ 29.51), lower REP (2.67 vs. 3.37), and lower DROP (5.23% vs. 5.89%) (Malaviya et al., 2018).
  • Text Classification: In sentiment classification, sparsemax-attention models induce ∼40% word-level sparsity, achieving slightly lower accuracy than softmax but with increased interpretability (Ribeiro et al., 2020).
  • Speaker Verification: In ad-hoc arrays, sparsemax excels at removing weights from noisy microphone channels, leading to 3–6% relative EER gains over already strong softmax-attention baselines (Liang et al., 2021).
  • Visual Question Answering: Sparsemax attention on spatial grids selects only the most relevant regions, improving interpretability and matching human-like attention patterns (Martins et al., 2020).
  • Kernel Regression Perspective: There is a formal correspondence between sparsemax and Epanechnikov kernel regression with adaptive normalization. Sparse attention mechanisms generalize principle alternatives to heuristic top-τ(z)\tau(z)1 selection (Santos et al., 30 Jan 2026).

A notable property is that sparsemax produces "hard" sparsity—only the top-τ(z)\tau(z)2 entries per query receive nonzero mass, with τ(z)\tau(z)3 determined textually. This contrasts with softmax, where all entries are nonzero and often difficult to interpret.

6. Theoretical Foundations and Generalizations

Sparsemax is a special case of the τ(z)\tau(z)4-entmax family (specifically, τ(z)\tau(z)5), which interpolates between softmax (τ(z)\tau(z)6) and polynomially sparse activations (e.g., biweight at τ(z)\tau(z)7). The general τ(z)\tau(z)8-entmax has closed-form: τ(z)\tau(z)9 Sparsemax realizes the imax{0,ziτ(z)}=1\sum_i \max\{0, z_i - \tau(z)\} = 10-projection; its support adapts to the input nonlinearly, producing hard zeros (Correia et al., 2019, Martins et al., 2020). This yields a bridge between convex duality, Tsallis entropy regularization, and deformed exponential families.

Furthermore, from a variational viewpoint, sparsemax is the gradient of the maximization of a linear form minus squared imax{0,ziτ(z)}=1\sum_i \max\{0, z_i - \tau(z)\} = 11 norm over the simplex—a smoothed approximation of the max-operator distinct from softmax, whose smoothing is through negative entropy (Niculae et al., 2017).

Sparsemax enables direct inspection of which tokens, channels, or regions the model "attended to," since zeros in the output map correspond to pruned, ignored locations. This effect improves transparency especially in structured or explainable settings.

However, sparsemax can at times collapse multimodal attention into single-mode, depending on the input. Other variants, such as MultiMax (Zhou et al., 2024), introduce piecewise-linear modulators to achieve a trade-off between sparsity and multimodality, and imax{0,ziτ(z)}=1\sum_i \max\{0, z_i - \tau(z)\} = 12-entmax allows for soft interpolation between behaviors.

Sparsemax can be further modified for additional constraints (e.g., upper bounds, structured penalties) or generalized to continuous domains, in which case density functions respect support constraints analogous to their discrete counterparts (Martins et al., 2020).


References:

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Sparsemax Pointwise Attention.