Evidential Softmax (ev-softmax)

Updated 11 June 2026

Ev-softmax is a normalization function that produces sparse, multimodal probability distributions by zeroing out logits below the arithmetic mean.
It features two variants—direct sparse normalization and a Dirichlet–categorical construction—designed to mitigate overconfidence and codebook collapse in discrete latent models.
Empirical studies highlight its efficacy in deep generative, attention-based, and semi-supervised models through efficient gradient backpropagation and closed-form updates.

Evidential Softmax (ev-softmax) refers to a class of normalization functions for neural network outputs that combine explicit modeling of epistemic uncertainty with properties of sparsity and multimodality. Unlike conventional softmax or sparsemax, ev-softmax is designed to (1) produce probability vectors that are both sparse and support multiple modes, (2) provide a tractable, closed-form backpropagation mechanism compatible with standard log-likelihood or KL-divergence losses, and (3) mitigate overconfidence and codebook collapse in discrete latent models such as VAEs and vector quantized architectures. There exist two principal lines of ev-softmax: one operating directly as a sparse normalization function for logits (Chen et al., 2021), and another, distributional, that arises from the Dirichlet–categorical construction for uncertainty-calibrated discrete representations (Baykal et al., 2023).

1. Mathematical Formulation and Variants

Two mathematically distinct but related formulations are presented in the literature:

1.1 Direct Sparse Normalization

Given input logits $v = (v_1, ..., v_K) \in \mathbb{R}^K$ , define the arithmetic mean $\bar v = \frac{1}{K} \sum_{i=1}^K v_i$ . Ev-softmax assigns nonzero probability only to entries above the mean:

$EvSoftmax(v)_k = \frac{1\{v_k \geq \bar v\} \exp(v_k)}{\sum_{j=1}^K 1\{v_j \geq \bar v\} \exp(v_j)}, \quad k=1,...,K,$

where $1\{\cdot\}$ is the indicator function. This induces exact zeros for sub-mean logits while preserving exponential weighting among the active set (Chen et al., 2021). To facilitate training with standard probabilistic losses, the "full-support" relaxation introduces a small $\epsilon>0$ :

$EvSoftmax_{\mathrm{train},\epsilon}(v)_k = \frac{(1\{v_k \geq \bar v\} + \epsilon) \exp(v_k)} {\sum_{j=1}^K (1\{v_j \geq \bar v\} + \epsilon) \exp(v_j)}$

1.2 Evidential Dirichlet–Categorical Construction

Given encoder outputs $l \in \mathbb{R}^K$ (interpreted as evidence), compute Dirichlet concentration parameters:

$\alpha = \exp(l) + 1$

The mean class probabilities under Dirichlet $(\alpha)$ are

$p_i = \mathbb{E}[\pi_i] = \frac{\alpha_i}{\sum_j \alpha_j}$

This "ev-softmax" replaces the conventional softmax but additionally attaches a Dirichlet KL regularizer, penalizing deviations from a uniform prior and thus discouraging overconfident, spiky assignments (Baykal et al., 2023).

2. Properties and Theoretical Characteristics

2.1 Sparsity and Multimodality

Ev-softmax explicitly zeros out all actions below the mean, guaranteeing sparse distributions. Unlike sparsemax and entmax, which may collapse multimodal supports into a single dominant mode, ev-softmax preserves all above-mean modes, ensuring interpretability and propagation of multiple hypotheses (Chen et al., 2021).

2.2 Differentiability and Gradients

On the active support ( $\bar v = \frac{1}{K} \sum_{i=1}^K v_i$ 0), the gradient of ev-softmax mirrors that of softmax restricted to the subset:

$\bar v = \frac{1}{K} \sum_{i=1}^K v_i$ 1

where $\bar v = \frac{1}{K} \sum_{i=1}^K v_i$ 2 is the Kronecker delta. Under the $\bar v = \frac{1}{K} \sum_{i=1}^K v_i$ 3-relaxation, gradients are defined everywhere, so ev-softmax can be used seamlessly with backpropagation-based training regimes (Chen et al., 2021).

2.3 Uncertainty and Regularization

The Dirichlet–categorical variant incorporates a KL penalty:

$\bar v = \frac{1}{K} \sum_{i=1}^K v_i$ 4

This attracts the concentration vector $\bar v = \frac{1}{K} \sum_{i=1}^K v_i$ 5 toward the uniform Dirichlet when the encoder provides weak evidence, penalizing overconfidence and reducing the risk of degeneracy in the representation (codebook collapse) (Baykal et al., 2023).

3. Training Methodologies and Implementation Considerations

3.1 Algorithmic Workflow

The following describes a typical training loop for evidential VAEs:

$EvSoftmax(v)_k = \frac{1\{v_k \geq \bar v\} \exp(v_k)}{\sum_{j=1}^K 1\{v_j \geq \bar v\} \exp(v_j)}, \quad k=1,...,K,$ 4 Key details include clamping logits before exponentiation, annealing the Gumbel–Softmax temperature and regularization weight, and using the relaxed categorical in training for differentiability. Standard log-likelihood and KL losses are directly applicable when using the continuous relaxation of the direct ev-softmax (Baykal et al., 2023, Chen et al., 2021).

3.2 Practical Recommendations

Use $\bar v = \frac{1}{K} \sum_{i=1}^K v_i$ 6 for the continuous relaxation of direct ev-softmax during training.
Apply layer normalization or calibration to input logits to control thresholding behavior imposed by mean-subtraction.
Employ Adam optimizer (lr $\bar v = \frac{1}{K} \sum_{i=1}^K v_i$ 7), moderate batch sizes, and numerically robust codebook parameterizations as in standard discrete VAE architectures (Baykal et al., 2023).
Clamp logit values pre-exponentiation to prevent overflow.

4. Comparison to Other Normalization Functions

Normalizer	Support	Sparsity	Multimodality	Special Loss Required
Softmax	Full	No	Yes	No
Sparsemax	Subset	Yes	No (collapse)	Yes (hinge/Poisson)
Entmax $\bar v = \frac{1}{K} \sum_{i=1}^K v_i$ 8	Interpolated	Yes	No (collapse)	Yes ( $\bar v = \frac{1}{K} \sum_{i=1}^K v_i$ 9-entmax)
Ev-softmax	Subset	Yes	Yes	No (with $EvSoftmax(v)_k = \frac{1\{v_k \geq \bar v\} \exp(v_k)}{\sum_{j=1}^K 1\{v_j \geq \bar v\} \exp(v_j)}, \quad k=1,...,K,$ 0)

Ev-softmax uniquely combines support for multimodality and strict sparsity with standard log-likelihood/KL compatibility when using its continuous relaxation. Empirically, it reduces dimensionality of the distribution while maintaining high distributional accuracy and balancing focus and context in attention mechanisms (Chen et al., 2021).

5. Empirical Performance and Use Cases

5.1 Deep Generative Models

Conditional VAE on MNIST: Ev-softmax learned exactly five nonzero modes per class (matching the even/odd structure), outperforming softmax, sparsemax, and entmax in terms of Wasserstein distance to the true prior.
VQ-VAE + PixelCNN on tinyImageNet: Ev-softmax achieved highest top-5/top-10 accuracy with 85% sparsity (using ~77 out of 512 codes on average), surpassing other sparse normalization strategies (Chen et al., 2021).

5.2 Attention and Sequence Models

Transformer NMT (IWSLT’14 EN→DE): With ev-softmax self-attention, models achieved the highest BLEU (29.4), best ROUGE/METEOR, and a balanced attention focus—attending to $EvSoftmax(v)_k = \frac{1\{v_k \geq \bar v\} \exp(v_k)}{\sum_{j=1}^K 1\{v_j \geq \bar v\} \exp(v_j)}, \quad k=1,...,K,$ 18 source words on average (vs. all for softmax, 2 for sparsemax, 4 for entmax) (Chen et al., 2021).

5.3 Evidential Discrete Representation Learning

EdVAE discrete VAE: The evidential layer replaces softmax, prevents codebook collapse, improves reconstruction, and enhances codebook usage compared to dVAE and VQ-VAE. The KL regularizer encourages the network away from overconfident assignments, leading to richer latent usage and more robust representations (Baykal et al., 2023).

5.4 Semi-supervised Learning

Semi-supervised VAE on MNIST: 97.3% classification accuracy with only 1.64 average active classes per prediction (84% sparsity). Competing sparse normalization schemes (e.g., sparsemax/entmax) had reduced accuracy or mode collapse (Chen et al., 2021).

6. Limitations, Practical Considerations, and Extensions

While ev-softmax is scale- and translation-invariant due to mean-subtraction, calibration of logits is necessary to control thresholding effects. Exact zeros (and the induced kinks at $EvSoftmax(v)_k = \frac{1\{v_k \geq \bar v\} \exp(v_k)}{\sum_{j=1}^K 1\{v_j \geq \bar v\} \exp(v_j)}, \quad k=1,...,K,$ 2) are a natural consequence of the hard support, though empirical results indicate these are not a barrier to convergence. The KL regularizer in evidential constructions (Dirichlet–categorical) must be carefully scheduled (e.g., ramp up $EvSoftmax(v)_k = \frac{1\{v_k \geq \bar v\} \exp(v_k)}{\sum_{j=1}^K 1\{v_j \geq \bar v\} \exp(v_j)}, \quad k=1,...,K,$ 3 slowly), and numerical stability precautions (clamping, exponentiation limits) are essential. A plausible implication is that further hybridizations—combining evidential uncertainty with task-specific structured sparsity—may yield new regimes of interpretable and robust discrete modeling in dense and sequence architectures (Baykal et al., 2023, Chen et al., 2021).

Markdown Report Issue Upgrade to Chat

References (2)

Evidential Softmax for Sparse Multimodal Distributions in Deep Generative Models (2021)

EdVAE: Mitigating Codebook Collapse with Evidential Discrete Variational Autoencoders (2023)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Evidential Softmax (ev-softmax).