Continuous Sparsemax Overview

Updated 21 December 2025

Continuous sparsemax is a probability mapping method that projects continuous score functions onto valid density spaces, inducing sparsity in the output distribution.
It employs a regularized variational formulation with closed-form solutions, enhancing computational efficiency and interpretability in applications like attention mechanisms and latent variable inference.
Empirical results show that continuous sparsemax yields compact, data-adaptive supports and competitive performance in tasks such as text classification, machine translation, and visual question answering.

Continuous sparsemax is a family of probability mapping functions that extend the sparsemax operator from the discrete (finite) to the continuous domain. As with discrete sparsemax, continuous sparsemax produces sparse probability distributions—support is restricted to a region in the continuous input space—by projecting a score function onto the set of valid probability densities. This approach brings computational, statistical, and interpretability advantages in machine learning models employing attention, latent variable inference, or multi-label prediction, especially in settings with continuous or large structured domains (Martins et al., 2020, Martins et al., 2021).

1. Theoretical Formulation of Continuous Sparsemax

Continuous sparsemax is obtained as the solution to a regularized variational problem over densities defined on a continuous domain $S\subseteq\mathbb{R}^n$ . Let $u:S\to\mathbb{R}$ be a score function (potential), then continuous sparsemax solves:

$p_2[u](x) = \arg\max_{p \ge 0,\, \int_S p(x)\, dx = 1} \Bigg\{ \int_S p(x) u(x) dx - \frac{1}{2} \int_S p(x)^2 dx \Bigg\}$

The Karush-Kuhn-Tucker conditions yield an explicit solution:

$p_2(x) = [u(x) - \lambda]_+$

where $[\,\cdot\,]_+$ denotes the positive part, and the normalization constant (threshold) $\lambda$ is chosen so that $p_2$ integrates to 1:

$\int_S [u(x) - \lambda]_+ dx = 1$

This is a direct continuous analogue of the discrete sparsemax projection onto the simplex (Martins et al., 2016, Martins et al., 2020).

2. Relation to Deformed Exponential Families and Fenchel-Young Losses

Continuous sparsemax is a special case of the general $\Omega$ -regularized prediction map within the Fenchel-Young loss framework (Martins et al., 2021). When the negentropy $\Omega$ is taken as the Tsallis-2 (quadratic) negentropy, the unique maximizer yields continuous sparsemax. More generally, $\alpha$ -entmax with parameter $\alpha=2$ specializes to continuous sparsemax:

$p_2(x) = \max(u(x) - \lambda,\,0)$

Within this framework, the Fenchel-Young loss for a model $f$ and empirical density $p$ is:

$L_{\Omega}(f; p) = \Omega^*(f) + \Omega(p) - \langle p, f \rangle$

This construction yields the moment-matching property in models parameterized linearly in features, and includes as special cases the maximum-entropy principle (softmax, $\alpha=1$ ) and sparsemax ( $\alpha=2$ ), both in finite and continuous settings (Martins et al., 2021).

3. Closed-Form Solutions: Quadratic Scores and β-Gaussians

For quadratic score functions $u(x) = -\frac{1}{2}(x - \mu)^T \Sigma^{-1} (x - \mu)$ , the solution $p_2(x)$ is a truncated paraboloid ("Epanechnikov kernel" in 1D; biweight, triweight, etc., in higher dimensions):

$p_2(x) = \bigg[ -\frac{1}{2}(x-\mu)^T\Sigma^{-1}(x-\mu) - \lambda \bigg]_+$

The support is given by an ellipsoid centered at $\mu$ . The normalization constant $\lambda$ and the support radius $r$ are obtained by solving the normalization constraint; for $n$ -dimensional cases, the normalization encompasses evaluation of truncated moment integrals (Martins et al., 2021).

These densities constitute the "β-Gaussian" family (with $\beta = 0$ ), a special case of elliptical distributions with bounded support.

4. Backpropagation and Efficient Implementation

The continuous sparsemax mapping is piecewise linear and almost everywhere differentiable. For backpropagation, the derivative with respect to the score function is:

$\frac{\partial p_2(x)}{\partial u(x)} = \begin{cases} 1, & u(x) > \lambda \ 0, & u(x) < \lambda \end{cases}$

For parametric $u_\theta(x) = \theta^T \phi(x)$ , the derivative of the context vector $c = \int_S p_2(x) V(x) dx$ with respect to $\theta$ reduces to integration over the support $\{u(x) > \lambda\}$ (Martins et al., 2020, Martins et al., 2020). In practice, normalization and expectation calculations involve root finding (for $\lambda$ ) and quadrature. With Gaussian basis expansions, closed-form or one-dimensional integrals are available (Martins et al., 2020, Martins et al., 2021).

All required steps are parallelizable and admit $O(N)$ complexity per integration point once the domain is discretized, or $O(1)$ per analytic RBF.

5. Connections to Structured Sparsity and Discrete Mappings

Continuous sparsemax generalizes discrete sparsemax (Martins et al., 2016, Correia et al., 2020), which itself is the Euclidean projection onto the probability simplex:

$\mathrm{sparsemax}(\mathbf{z}) = \arg\min_{p \in \Delta^K} \|p - \mathbf{z}\|_2^2$

In the continuous domain, the projection becomes onto the space of normalized, non-negative densities. This yields exactly zero densities outside a data-adaptive support.

Other extensions include top-k sparsemax (restriction to the k largest entries) and SparseMAP for structured latent variable spaces, enabling efficient marginalization with sparse support (Correia et al., 2020).

6. Empirical Properties and Applications

Continuous sparsemax has been empirically shown to concentrate probability mass on compact subdomains, producing sparse, interpretable density maps. In attention mechanisms for text classification, machine translation, and visual question answering, continuous sparsemax yields competitive or improved accuracy over standard softmax:

On IMDB document classification ( $L \approx 280$ ), a discrete+continuous sparsemax hybrid achieves 91.18% test accuracy, slightly above discrete softmax's 90.78% (Martins et al., 2020).
In IWSLT De→En machine translation, BLEU score rises from 23.92 (discrete softmax) to 24.25 (hybrid).
In VQA-v2 with 2D grids, continuous sparsemax yields compact attention ellipses and comparable accuracy (66.1%) to discrete softmax (66.13%) (Martins et al., 2020, Martins et al., 2021).

Additionally, in audio classification and vision, continuous sparsemax identifies sharp temporal/spatial intervals, enhancing interpretability (Martins et al., 2021).

7. Extensions, Complexity, and Control of Sparsity

Continuous sparsemax admits several extensions by varying the underlying regularizer. Fusedmax employs a total variation or Sobolev regularization to further encourage piecewise constant or smooth sparse densities. The unified sparsegen framework introduces mappings (sparsegen-lin, sparsehourglass) that generalize and interpolate between sparsemax, sum-normalization, and softmax, allowing explicit control over the degree of sparsity via additional hyperparameters (Laha et al., 2018). All such mappings preserve the projection-based construction and piecewise-linear gradients, enabling efficient implementation.

In summary, continuous sparsemax provides a principled, computationally efficient means of inducing sparsity in continuous probability distributions, extending the discrete sparsemax and $\alpha$ -entmax frameworks to continuous domains, with direct utility for attention mechanisms, latent variable modeling, and interpretable probabilistic inference (Martins et al., 2020, Correia et al., 2020, Martins et al., 2020, Martins et al., 2016, Martins et al., 2021, Laha et al., 2018).