Continuous Sparsemax Overview
- Continuous sparsemax is a probability mapping method that projects continuous score functions onto valid density spaces, inducing sparsity in the output distribution.
- It employs a regularized variational formulation with closed-form solutions, enhancing computational efficiency and interpretability in applications like attention mechanisms and latent variable inference.
- Empirical results show that continuous sparsemax yields compact, data-adaptive supports and competitive performance in tasks such as text classification, machine translation, and visual question answering.
Continuous sparsemax is a family of probability mapping functions that extend the sparsemax operator from the discrete (finite) to the continuous domain. As with discrete sparsemax, continuous sparsemax produces sparse probability distributions—support is restricted to a region in the continuous input space—by projecting a score function onto the set of valid probability densities. This approach brings computational, statistical, and interpretability advantages in machine learning models employing attention, latent variable inference, or multi-label prediction, especially in settings with continuous or large structured domains (Martins et al., 2020, Martins et al., 2021).
1. Theoretical Formulation of Continuous Sparsemax
Continuous sparsemax is obtained as the solution to a regularized variational problem over densities defined on a continuous domain . Let be a score function (potential), then continuous sparsemax solves:
The Karush-Kuhn-Tucker conditions yield an explicit solution:
where denotes the positive part, and the normalization constant (threshold) is chosen so that integrates to 1:
This is a direct continuous analogue of the discrete sparsemax projection onto the simplex (Martins et al., 2016, Martins et al., 2020).
2. Relation to Deformed Exponential Families and Fenchel-Young Losses
Continuous sparsemax is a special case of the general -regularized prediction map within the Fenchel-Young loss framework (Martins et al., 2021). When the negentropy is taken as the Tsallis-2 (quadratic) negentropy, the unique maximizer yields continuous sparsemax. More generally, -entmax with parameter specializes to continuous sparsemax:
Within this framework, the Fenchel-Young loss for a model and empirical density is:
This construction yields the moment-matching property in models parameterized linearly in features, and includes as special cases the maximum-entropy principle (softmax, ) and sparsemax (), both in finite and continuous settings (Martins et al., 2021).
3. Closed-Form Solutions: Quadratic Scores and β-Gaussians
For quadratic score functions , the solution is a truncated paraboloid ("Epanechnikov kernel" in 1D; biweight, triweight, etc., in higher dimensions):
The support is given by an ellipsoid centered at . The normalization constant and the support radius are obtained by solving the normalization constraint; for -dimensional cases, the normalization encompasses evaluation of truncated moment integrals (Martins et al., 2021).
These densities constitute the "β-Gaussian" family (with ), a special case of elliptical distributions with bounded support.
4. Backpropagation and Efficient Implementation
The continuous sparsemax mapping is piecewise linear and almost everywhere differentiable. For backpropagation, the derivative with respect to the score function is:
For parametric , the derivative of the context vector with respect to reduces to integration over the support (Martins et al., 2020, Martins et al., 2020). In practice, normalization and expectation calculations involve root finding (for ) and quadrature. With Gaussian basis expansions, closed-form or one-dimensional integrals are available (Martins et al., 2020, Martins et al., 2021).
All required steps are parallelizable and admit complexity per integration point once the domain is discretized, or per analytic RBF.
5. Connections to Structured Sparsity and Discrete Mappings
Continuous sparsemax generalizes discrete sparsemax (Martins et al., 2016, Correia et al., 2020), which itself is the Euclidean projection onto the probability simplex:
In the continuous domain, the projection becomes onto the space of normalized, non-negative densities. This yields exactly zero densities outside a data-adaptive support.
Other extensions include top-k sparsemax (restriction to the k largest entries) and SparseMAP for structured latent variable spaces, enabling efficient marginalization with sparse support (Correia et al., 2020).
6. Empirical Properties and Applications
Continuous sparsemax has been empirically shown to concentrate probability mass on compact subdomains, producing sparse, interpretable density maps. In attention mechanisms for text classification, machine translation, and visual question answering, continuous sparsemax yields competitive or improved accuracy over standard softmax:
- On IMDB document classification (), a discrete+continuous sparsemax hybrid achieves 91.18% test accuracy, slightly above discrete softmax's 90.78% (Martins et al., 2020).
- In IWSLT De→En machine translation, BLEU score rises from 23.92 (discrete softmax) to 24.25 (hybrid).
- In VQA-v2 with 2D grids, continuous sparsemax yields compact attention ellipses and comparable accuracy (66.1%) to discrete softmax (66.13%) (Martins et al., 2020, Martins et al., 2021).
Additionally, in audio classification and vision, continuous sparsemax identifies sharp temporal/spatial intervals, enhancing interpretability (Martins et al., 2021).
7. Extensions, Complexity, and Control of Sparsity
Continuous sparsemax admits several extensions by varying the underlying regularizer. Fusedmax employs a total variation or Sobolev regularization to further encourage piecewise constant or smooth sparse densities. The unified sparsegen framework introduces mappings (sparsegen-lin, sparsehourglass) that generalize and interpolate between sparsemax, sum-normalization, and softmax, allowing explicit control over the degree of sparsity via additional hyperparameters (Laha et al., 2018). All such mappings preserve the projection-based construction and piecewise-linear gradients, enabling efficient implementation.
In summary, continuous sparsemax provides a principled, computationally efficient means of inducing sparsity in continuous probability distributions, extending the discrete sparsemax and -entmax frameworks to continuous domains, with direct utility for attention mechanisms, latent variable modeling, and interpretable probabilistic inference (Martins et al., 2020, Correia et al., 2020, Martins et al., 2020, Martins et al., 2016, Martins et al., 2021, Laha et al., 2018).