Sparsemax: Sparse Probability Mapping

Updated 22 September 2025

Sparsemax is a sparse probability mapping that projects a real-valued vector onto the probability simplex, producing outputs with exact zeros.
It enhances model interpretability and control by yielding concise, selective outputs useful in attention mechanisms, multi-label classification, and structured prediction.
Its efficient computation via sorting and thresholding, along with a differentiable Jacobian, supports practical integration into advanced machine learning models.

Sparsemax is a piecewise linear, sparse probability mapping that generalizes and contrasts with the ubiquitous softmax function. Defined as the Euclidean projection of a real-valued vector onto the probability simplex, sparsemax is able to produce probability distributions with exact zeros, leading to sparse solutions that are particularly appealing for attention mechanisms, multi-label classification, structured prediction, and uncertainty quantification. Sparsemax retains several key invariance and equivariance properties of softmax, but its non-dense outputs underpin better interpretability, potential efficiency, and more precise control over model behavior.

1. Mathematical Definition and Key Properties

Sparsemax is defined as the projection of an input vector $z \in \mathbb{R}^K$ onto the probability simplex $\Delta^{K-1} = \{ p \in \mathbb{R}^K: \sum_{j} p_j = 1,\, p_j \geq 0 \}$ , resulting in: $\operatorname{sparsemax}(z) := \underset{p\in \Delta^{K-1}}{\arg\min} \| p - z \|^2$ The unique closed-form solution is: $\operatorname{sparsemax}_i(z) = \big[ z_i - \tau(z) \big]_+$ where $[\,\cdot\,]_+ = \max\{0, \cdot\}$ and $\tau(z)$ is a threshold satisfying $\sum_j [z_j - \tau(z)]_+ = 1$ . Equivalently, if $S(z) = \{ j : z_j > \tau(z) \}$ , then: $\tau(z) = \frac{ \sum_{j \in S(z)} z_j - 1 }{ |S(z)| }$ The support $S(z)$ is data-dependent, and sparsemax "hits the boundary of the simplex," setting many entries of the output to zero exactly.

The function is translation invariant ( $\operatorname{sparsemax}(z + c \mathbf{1}) = \operatorname{sparsemax}(z)$ for all $c$ ) and permutation equivariant. In the low-temperature limit (large scaling of $z$ ), the distribution concentrates mass on the largest entries.

2. Relationship to Softmax, Entmax, and Other Probability Mappings

Softmax can be seen as the unique solution to maximizing a linear term plus Shannon entropy over the simplex: $\operatorname{softmax}(z) = \underset{p \in \Delta^{K-1}}{\arg\max} \big( p^\top z + H(p) \big)$ where $H(p) = -\sum_{j} p_j \log p_j$ . In contrast, sparsemax solves a projection problem involving Euclidean distance. Sparsemax can also be understood as an instantiation of the more general family of $\alpha$ -entmax transformations: $\alpha\text{-entmax}(z) = \underset{p\in\Delta^K}{\arg\max} \Big( p^\top z + H_\alpha(p) \Big )$ with Tsallis $\alpha$ -entropy $H_\alpha(p)$ . Softmax is recovered at $\alpha=1$ , sparsemax at $\alpha=2$ , and $\alpha > 1$ introduces sparsity into the mapping (Peters et al., 2019).

Unified frameworks, such as "sparsegen," further show that a family of mappings (softmax, sparsemax, sum-normalization, spherical softmax) can be recovered by varying the transformation and regularization terms within a convex optimization over the simplex, and by explicit parameterization, the degree of sparsity can be controlled (Laha et al., 2018).

3. Computational Aspects and Differentiation

The forward sparsemax projection can be computed via O( $K$ log $K$ ) sorting and thresholding. The Jacobian of sparsemax, required for backpropagation, is piecewise constant for a fixed support set $S(z)$ and is given by: $J_{\operatorname{sparsemax}}(z) = \operatorname{Diag}(s) - \frac{ss^\top}{|S(z)|}$ where $s$ is the indicator vector of $S(z)$ . If the support is already computed, the Jacobian-vector product (needed for backward computation) can be evaluated in O( $|S(z)|$ ) time, which is favorable when the output is highly sparse (Martins et al., 2016).

4. Loss Functions and Statistical Connections

Sparsemax does not have a direct negative log likelihood, as its outputs may be zero (making $\log 0$ undefined). Instead, it admits a canonical convex loss—sparsemax loss: $L_{\operatorname{sparsemax}}(z; k) = -z_k + \frac{1}{2} \sum_{j \in S(z)} (z_j^2 - \tau(z)^2) + \frac{1}{2}$ with gradient $\nabla_z L_{\operatorname{sparsemax}}(z; k) = \operatorname{sparsemax}(z) - \delta_k$ , where $\delta_k$ is the one-hot vector. In the binary case, this loss reduces to a "modified Huber loss": $L_{\operatorname{sparsemax}}(t) = \begin{cases} 0 & t \geq 1 \ -t & t \leq -1 \ \frac{(t-1)^2}{4} & -1 < t < 1 \end{cases}$ This robust classification loss exhibits margin-based properties and connects to the literature on robust statistics.

5. Applications and Empirical Results

Sparsemax has been deployed across a range of fields:

Attention mechanisms: Sparsemax in attention layers yields sparse attention maps, focusing weight on key elements in inputs. Substituting softmax with sparsemax in natural language inference or VQA models leads to sparser, more interpretable attention without loss in predictive accuracy, and sometimes with small improvements (Martins et al., 2016, Martins et al., 2020).
Multi-label classification: Used to model label sets or proportions, sparsemax improves estimation accuracy (mean squared error, Jensen–Shannon divergence) when the target distributions are sparse. Empirical results on multi-label datasets (Scene, Emotions, Birds, CAL500, Reuters) indicate comparable or better performance than softmax or independent binary logistic regressions, particularly for large label spaces (Martins et al., 2016).
Structured and constrained attention: Extensions such as constrained sparsemax (that add upper bounds/fertility in MT) provide sparser, bounded alignments, reducing over-translation and under-translation effects and lowering error metrics like REP-score and DROP-score compared to softmax (Malaviya et al., 2018).
Imitation learning: In maximum causal Tsallis entropy imitation learning, the optimal policy takes the sparsemax functional form, enabling sparse multi-modal policies that assign zero mass to suboptimal actions and improve learning of diverse expert behaviors (Lee et al., 2018).
Topic modeling: Sparsemax is used to induce hard sparsity in neural topic models, enabling more interpretable document–topic and topic–word distributions, and leading to improved perplexity and topic coherence (PMI) compared to softmax-based constructions, especially for short texts (Lin et al., 2018).
Feature selection and embedding transfer: By projecting mask vectors or similarity scores via sparsemax, it is possible to select features or tokens in a differentiable, controlled manner (e.g., SLM for feature selection (Dong et al., 2023), FOCUS for multilingual embedding transfer (Dobler et al., 2023)).
Uncertainty quantification: Sparsemax and its entmax generalizations have been shown to provide efficient, interpretable set predictors in conformal prediction frameworks, yielding prediction sets with marginal coverage guarantees and competitive efficiency (Campos et al., 20 Feb 2025).

6. Extensions: Structure, Control, and Continuous Domains

Sparsemax has motivated numerous extensions.

Structured Sparsemax: By incorporating structured penalties (e.g., total variation, fused lasso) into the regularization, variants like fusedmax and TVmax induce contiguous or grouped sparsity, enhancing interpretability in applications such as summarization, vision, or multi-channel selection (Niculae et al., 2017, Martins et al., 2020).
Controllable Sparsity: Frameworks such as sparsegen-lin, sparsehourglass, and scaling sparsemax provide explicit control over the level of sparsity, allowing the number of nonzero outputs to be dynamically tuned (Laha et al., 2018, Chen et al., 2021, Dong et al., 2023).
Continuous Sparsemax: The Ω-regularized prediction map and Fenchel–Young loss framework generalize sparsemax to continuous inputs, enabling the construction of probability densities with compact support over continuous domains—a property useful in attention over time or space (Martins et al., 2020, Martins et al., 2021).

7. Limitations and Empirical Trade-offs

While sparsemax offers clear interpretability and potential computational advantages, limitations are dataset- and task-dependent:

On certain document classification tasks (e.g., IMDB sentiment analysis), replacing softmax with sparsemax in hierarchical attention networks did not result in significant gains in predictive performance, though sparsity may enhance interpretability (Ribeiro et al., 2020).
In matching or outperforming softmax on standard metrics, statistically significant differences are more evident in tasks where ground truth or optimal outputs are sparse.
In safety-critical settings such as autonomous driving, even with piecewise linearity, sparsemax-based architectures may not automatically improve robustness properties and require careful, setting-dependent verification (Liao et al., 2022).

Sparsemax represents a foundational alternative to softmax, grounded in a geometrically motivated projection and producing sparse probability distributions. Its utility spans attention, structured prediction, feature selection, imitation learning, uncertainty quantification, and beyond. By enabling direct, differentiable, and interpretable sparsity, sparsemax and its generalizations continue to serve as a catalyst for advances in model efficiency, interpretability, and control across machine learning.