Modified Softmax Functions

Updated 15 April 2026

Modified softmax functions are advanced activation alternatives that address limitations of the canonical softmax, including gradient vanishing, label noise, and lack of sparsity.
They employ innovations such as SA-Softmax for gradient stabilization, MGF-softmax for efficient computation, and geometric clustering to boost adversarial robustness.
Empirical results show these variants achieve up to 6–7× improved resistance to one-pixel attacks and enhanced convergence in diverse architectures from transformers to multimodal networks.

Modified softmax functions constitute a rapidly expanding class of alternatives to the canonical softmax activation. These modifications address diverse deficiencies of the standard function, including gradient vanishing, sparsity control, robustness to label noise, computational cost, and statistical learning rates. Contemporary research has produced a wide spectrum of variants, ranging from theoretically motivated geometric reinterpretations to engineering-driven numerical approximations and sparsification strategies. Below, the principal directions and technical features of modified softmax functions are surveyed.

1. Geometric and Clustering Perspectives

A rigorous formal equivalence exists between softmax-based classifiers and $k$ -means clustering in the transformed feature space. Specifically, given any feedforward neural network with a final linear layer $W\in\mathbb{R}^{d\times c}$ and penultimate mapping $f_p\colon\mathbb{R}^n\to\mathbb{R}^d$ , the softmax classifier partitions the $f_p$ -space into Voronoi cells with equidistant centroids $Z_{\cdot k}=W_{\cdot k}+v$ , and performs nearest-centroid decoding: $\arg\max_{k}\;f_p(x)^\top W_{\cdot k} \;=\; \arg\min_{k}\;\|f_p(x)-Z_{\cdot k}\|^2$ Under this interpretation, the standard softmax loss encourages “cone-based” decisions, disregarding within-cone distances. The “Gauss” (or centroid-based) alternative replaces the softmax with a distribution explicitly decaying with squared Euclidean distance to a learned centroid $\mu_k$ per class: $P(y=k\mid x) = \frac{\exp(-\|f_p(x)-\mu_k\|^2)}{\sum_j \exp(-\|f_p(x)-\mu_j\|^2)}$ This replacement enforces tighter mapping of samples to their respective centroids, raising the robustness threshold for adversarial perturbations via direct control over the Lipschitz continuity of the penultimate mapping and increasing inter-centroid distance. Empirically, the “Gauss” activation delivers 6–7× improved resistance to one-pixel attacks over vanilla softmax, with comparable clean accuracy and better confidence calibration (Hess et al., 2020).

2. Gradient-Stabilized Softmax Modifications

Standard softmax exhibits severe gradient vanishing when input elements attain extreme values, as $\frac{\partial \alpha_j}{\partial x_j} = \alpha_j(1-\alpha_j)\to0$ for $\alpha_j\approx 0$ or $W\in\mathbb{R}^{d\times c}$ 0. Self-Adjust Softmax (SA-Softmax) mitigates this by introducing a multiplicative self-modulating term,

$W\in\mathbb{R}^{d\times c}$ 1

and a normalized variant,

$W\in\mathbb{R}^{d\times c}$ 2

The Jacobian of SA-Softmax contains an additive identity $W\in\mathbb{R}^{d\times c}$ 3, boosting the smallest singular value and preventing gradient collapse even in saturated regimes. Integrated into transformer attention mechanisms, SA-Softmax yields consistent perplexity reductions (1–3%) and nontrivial accuracy and BLEU improvements across classification and translation tasks (up to 6.3 pp/1.0 BLEU, respectively). The normalized variant tends to stabilize training further (Zheng et al., 25 Feb 2025).

3. Structural and Computational Modifications for Efficiency

Efficient realization of softmax in privacy-preserving computation, such as homomorphic encryption (HE), motivates approximations that avoid max and division operations. MGF-softmax replaces the partition function with a moment-generating-function- (MGF-) based exponential shift: $W\in\mathbb{R}^{d\times c}$ 4 This yields multiplicative depth reductions $W\in\mathbb{R}^{d\times c}$ 5 (vs $W\in\mathbb{R}^{d\times c}$ 6 for baseline), eliminating bootstrapping for typical input sizes, and matches the accuracy of plaintext softmax within 1%, as validated on LLaMA-3.2B (NLU) and ViT/DeiT (vision) (Park et al., 2 Feb 2026). Other variants in the MPC/HE literature, such as ReLU-based output normalization, trade off accuracy and only offer modest speed-ups except in shal-low nets, and are not recommended for multi-layer models (Keller et al., 2020).

The online normalizer approach fuses the maximum-finding and sum-accumulation passes in the standard “safe” softmax via running recurrences, reducing memory transfers per element by $W\in\mathbb{R}^{d\times c}$ 733% and achieving up to $W\in\mathbb{R}^{d\times c}$ 8 speed-ups for large-batch computations on GPUs. If paired with TopK selection, memory access can be reduced up to $W\in\mathbb{R}^{d\times c}$ 9 (Milakov et al., 2018).

4. Sparsity-Inducing and Selective Variants

Classical softmax assigns strictly positive probabilities to all classes, which can hinder convergence speed and model discrimination for high-dimensional outputs. Multiple modified functions have been proposed to achieve sparsity and dynamic support selection:

r-softmax: Enforces a fixed fraction $f_p\colon\mathbb{R}^n\to\mathbb{R}^d$ 0 of exactly zero outputs by thresholding logits at the empirical quantile, yielding exactly $f_p\colon\mathbb{R}^n\to\mathbb{R}^d$ 1 zeros. The mechanism is controlled via a quantile-based shift before softmax, offering differentiability nearly everywhere and adaptability by coupling $f_p\colon\mathbb{R}^n\to\mathbb{R}^d$ 2 to auxiliary networks. r-softmax outperforms other sparse mappings and produces higher F1 in multi-label settings, and yields improvements when used for attention in transformers (Bałazy et al., 2023).
Sparse-softmax: Masks all but the $f_p\colon\mathbb{R}^n\to\mathbb{R}^d$ 3 largest logits, exponentiates and normalizes over that support. The choice of $f_p\colon\mathbb{R}^n\to\mathbb{R}^d$ 4 controls both margin and learning focus: for $f_p\colon\mathbb{R}^n\to\mathbb{R}^d$ 5, sparse-softmax reduces the required margin from $f_p\colon\mathbb{R}^n\to\mathbb{R}^d$ 6 to $f_p\colon\mathbb{R}^n\to\mathbb{R}^d$ 7 in cross-entropy. It yields faster convergence and improved macro/micro F1 in high-dimensional text and sequence models for moderate $f_p\colon\mathbb{R}^n\to\mathbb{R}^d$ 8 (e.g., $f_p\colon\mathbb{R}^n\to\mathbb{R}^d$ 9 with $f_p$ 0) (Sun et al., 2021).
Input/Feature Transformed Gating: In mixture-of-experts models with softmax gating, transforming the input $f_p$ 1 before gating removes pathological expert–gate PDE couplings that otherwise slow parameter estimation rates below any polynomial. Choices such as $f_p$ 2 restore parametric estimation and improve convergence in multinomial logistic MoEs (Nguyen et al., 2023).

5. Alternative Basis Functions and Spherical/Taylor Families

The exponential in softmax can be replaced by other normalizers for various optimization or architectural benefits.

Spherical-family losses: Use quadratic-based normalizers, e.g., log-spherical softmax $f_p$ 3 and log-Taylor softmax is based on the second-order Taylor expansion of the exponential, $f_p$ 4. These normalizers permit $f_p$ 5 weight updates (vs $f_p$ 6 for softmax) and excel for low $f_p$ 7, but are outperformed by softmax for very large output dimensions (Brébisson et al., 2015).
Taylor and Soft-Margin Taylor Softmax: Explicitly replace $f_p$ 8 with finite/infinite Taylor expansions in both the normalizer and the gradient. The SM-Taylor variant adds a margin penalty $f_p$ 9 to promote class separability. On benchmarks, SM-Taylor with small $Z_{\cdot k}=W_{\cdot k}+v$ 0 (e.g., $Z_{\cdot k}=W_{\cdot k}+v$ 1=2 for MNIST/CIFAR-10) always meets or exceeds softmax accuracy (Banerjee et al., 2020).
Periodic Softmax Alternatives: In attention mechanisms where dot-product scores are not normally distributed and softmax gradients vanish, replacing $Z_{\cdot k}=W_{\cdot k}+v$ 2 by periodic functions such as $Z_{\cdot k}=W_{\cdot k}+v$ 3 or phase-shifted $Z_{\cdot k}=W_{\cdot k}+v$ 4 yields oscillatory but bounded gradients, preventing vanishing and yielding up to $Z_{\cdot k}=W_{\cdot k}+v$ 5pp accuracy gains in deep self-attention models. Pre-normalization and careful phase design are required for stability (Wang et al., 2021).

6. Noise-Robust and Distributional Modulation Variants

Noisy labels degrade standard loss functions unless outputs are close to one-hot or symmetric. $Z_{\cdot k}=W_{\cdot k}+v$ 6-Softmax augments the largest softmax output by a constant $Z_{\cdot k}=W_{\cdot k}+v$ 7 and divides by $Z_{\cdot k}=W_{\cdot k}+v$ 8, guaranteeing proximity to the one-hot set and providing explicit excess-risk bounds under asymmetric noise: $Z_{\cdot k}=W_{\cdot k}+v$ 9 where $\arg\max_{k}\;f_p(x)^\top W_{\cdot k} \;=\; \arg\min_{k}\;\|f_p(x)-Z_{\cdot k}\|^2$ 0. The excess risk decays as $\arg\max_{k}\;f_p(x)^\top W_{\cdot k} \;=\; \arg\min_{k}\;\|f_p(x)-Z_{\cdot k}\|^2$ 1, and empirical results on CIFAR-10/100 and large-scale datasets show state-of-the-art robustness across all noise regimes, with negligible cost (Wang et al., 4 Aug 2025).

Invertible modifications, as in the “softmax $\arg\max_{k}\;f_p(x)^\top W_{\cdot k} \;=\; \arg\min_{k}\;\|f_p(x)-Z_{\cdot k}\|^2$ 2” mapping, build reparameterizable distributions on the simplex (for variational inference), allowing the use of Gaussian sources with explicit Jacobians and closed-form KL divergences, outperforming Gumbel-Softmax in likelihood and gradient variance (Potapczynski et al., 2019).

7. Sampled and Reinforcement Learning Operators

For large $\arg\max_{k}\;f_p(x)^\top W_{\cdot k} \;=\; \arg\min_{k}\;\|f_p(x)-Z_{\cdot k}\|^2$ 3, sampled softmax with kernel-based importance sampling replaces the exact partition function by a mini-batch estimate; adaptive kernels tracking $\arg\max_{k}\;f_p(x)^\top W_{\cdot k} \;=\; \arg\min_{k}\;\|f_p(x)-Z_{\cdot k}\|^2$ 4 reduce the bias inherent to uniform sampling (by orders of magnitude), achieving close-to-exact performance with tens or hundreds of samples (Blanc et al., 2017).

In sequential decision-making, “mellowmax” is a quasi-arithmetic mean operator

$\arg\max_{k}\;f_p(x)^\top W_{\cdot k} \;=\; \arg\min_{k}\;\|f_p(x)-Z_{\cdot k}\|^2$ 5

which, unlike the Boltzmann operator, is a non-expansion under $\arg\max_{k}\;f_p(x)^\top W_{\cdot k} \;=\; \arg\min_{k}\;\|f_p(x)-Z_{\cdot k}\|^2$ 6 and guarantees unique fixed-point convergence in planning and SARSA. Policies derived via state-dependent temperature maximize expected value plus entropy with moment-matching constraints, yielding stable and performant RL updates (Asadi et al., 2016).

In summary, modified softmax functions span a continuum from geometric reinterpretations and gradient stabilization, through sparsification and noise-robustification, to architecture-specific and efficiency-driven transformations. These advances have been directly integrated into modern deep learning pipelines for classification, language modeling, vision, transformers, multi-label learning, and variational inference, producing gains in empirical accuracy, scalability, robustness, and theoretical convergence.