Ghosts of Softmax: Hidden Dynamics

Updated 19 March 2026

Ghosts of Softmax are failure modes arising from the softmax function, causing misleading confidence, inevitable dispersion, and irreducible ghost mass.
They impact deep learning by inducing optimization singularities, rank-deficit bias, and degraded attention mechanisms in classification and transformer models.
Research remedies include adaptive softmax variants, dynamic temperature tuning, and per-step optimization controllers to mitigate ghost effects.

The term “Ghosts of Softmax” encompasses a set of failure modes, optimization pathologies, and latent biases arising from the mathematical and geometric structure of the softmax function in deep learning. These phenomena manifest as misleading confidence reporting, inevitable dispersion in attention, optimization singularities, irreducible residual (ghost) probability mass, and rank-deficient representation collapse across classification and attention-based models. Modern research provides a comprehensive, multi-perspective analysis of these ghosts, their theoretical origins, empirical consequences, and remedies.

1. Definition and Mathematical Origins

The softmax function, mapping logits $z = (z_1, ..., z_M)$ to a probability vector $p \in [0,1]^M$ by $p_i = \exp(z_i)/\sum_j \exp(z_j)$ , is foundational in classification heads and transformer architectures. Its design yields several mathematically necessary “ghost” effects:

Non-injectivity and Surjective Mapping: Different logit vectors can produce identical softmax outputs, especially once one or more logits become dominant, eliminating information on the scale and direction of overoptimization (Ozbulak et al., 2018).
Dispersion with Growing Support: With bounded logits, as the number of items $n \to \infty$ , all softmax weights inevitably scale as $O(1/n)$ . No single item can retain a non-trivial share of mass in the large- $n$ limit, even in well-trained attention heads (Veličković et al., 2024).
Guaranteed Non-zero Ghost Mass: Softmax assigns nonzero probability to all entries; the “ghost mass” $G(z) = 1 - p_{i^{*}}$ (where $i^{*} = \arg\max_i z_i$ ) cannot vanish except in the infinite logit limit (Zhou et al., 2024).
Complex-analytic Structure: The partition function $F(z) = \sum_k \exp(z_k)$ , analytic in $z$ , has complex zeros off the real axis (“ghosts of softmax”). These induce logarithmic singularities in the cross-entropy loss $p \in [0,1]^M$ 0, setting a hard geometric constraint on optimization step size via the Taylor convergence radius (Sao, 13 Mar 2026).

2. Ghosts in Adversarial Evaluation and Optimization

Softmax-derived confidence is extensively used as a metric of adversarial robustness but is fundamentally misleading:

Saturation of Confidence: Once a logit $p \in [0,1]^M$ 1 surpasses others by even moderate margins (e.g., $p \in [0,1]^M$ 2), $p \in [0,1]^M$ 3 saturates numerically at $p \in [0,1]^M$ 4. Further increases are invisible in $p \in [0,1]^M$ 5, but can dramatically increase adversarial transferability. This effect is exploited by over-optimized attacks, where transfer success climbs (from $p \in [0,1]^M$ 6 to $p \in [0,1]^M$ 7) while $p \in [0,1]^M$ 8 remains pinned at $p \in [0,1]^M$ 9 (Ozbulak et al., 2018).
Joint Logit Growth: Growing two logits together to extreme values, while preserving their difference, produces “low-confidence” adversarial examples ( $p_i = \exp(z_i)/\sum_j \exp(z_j)$ 0), yet these are highly transferable—obscuring strength in the confidence output (Ozbulak et al., 2018).
Optimization Pathologies and Ghost Plateaus: In transformer self-attention, softmax can cause vanishing gradients for neglected tokens, leading to extended plateaus in learning multi-step tasks, only broken by stochastic “Eureka-moments” when attention lifts off these plateaus by chance. Remedies like temperature ramp-up and NormSoftmax normalization restore gradient flow (Hoffmann et al., 2023).

3. Dispersion and Hidden Limits in Attention Circuits

The analytical inevitability of dispersion in softmax-based attention undermines assumptions about the “sharpness” and generalization of learned circuits:

Theoretical Dispersion: Attention heads trained to perform “argmax-like” retrieval over bounded or discrete vocabularies must dilute their attention for large supports. The maximum possible coefficient for any item decays as $p_i = \exp(z_i)/\sum_j \exp(z_j)$ 1 for $p_i = \exp(z_i)/\sum_j \exp(z_j)$ 2 large, irrespective of the head’s training sharpness (Veličković et al., 2024).
Empirical Signature: Learned attention that appears sharply focused at the largest seen $p_i = \exp(z_i)/\sum_j \exp(z_j)$ 3 (e.g., $p_i = \exp(z_i)/\sum_j \exp(z_j)$ 4) gradually spreads toward uniformity as $p_i = \exp(z_i)/\sum_j \exp(z_j)$ 5 increases, degrading performance on out-of-distribution context sizes. Entropy per head increases monotonically with $p_i = \exp(z_i)/\sum_j \exp(z_j)$ 6. Adaptive temperature controllers can partially restore sharpness by re-centering the entropy in each layer (Veličković et al., 2024).
Generalization Collapse: Mechanistically interpretable “circuits” in transformer attention can only operate sharply up to the scale seen in training data. When evaluated on longer contexts, these circuits ghost—their discriminative capacity dilutes with growing input size.

4. Geometric and Analytic Constraints on Training Dynamics

The ghosts of softmax also emerge as hard geometric obstacles in loss-surface geometry and optimization:

Complex Zeros Determining Safe Step Sizes: The true convergence radius of a Taylor expansion of cross-entropy loss is not determined by local curvature (Hessian), but by the nearest complex zero of the partition function $p_i = \exp(z_i)/\sum_j \exp(z_j)$ 7 in the update direction. For multiclass problems, the lower bound on safe step size is $p_i = \exp(z_i)/\sum_j \exp(z_j)$ 8, with $p_i = \exp(z_i)/\sum_j \exp(z_j)$ 9 being the spread of logit derivatives in the chosen direction. Steps beyond this radius predictably induce loss inflation and accuracy collapse (Sao, 13 Mar 2026).
Unified Bound Across Architectures: This geometric constraint is empirically tight: across MLPs, CNNs, ResNets, transformers, and for multiple optimization directions, no model collapses for normalized steps $n \to \infty$ 0; catastrophic collapse is consistent for $n \to \infty$ 1. Temperature scaling shifts this bound proportionally.
Practical Step Control: An explicit controller enforcing $n \to \infty$ 2 enables survival through even extreme learning-rate spikes (up to $n \to \infty$ 3), outperforming standard gradient norm clipping methods (Sao, 13 Mar 2026).

5. Residual Ghost Mass and Sparsity-Multimodality Trade-off

Softmax’s structure ensures irreducible “ghost mass” that undermines sharp mode selection:

Fundamental Trade-Off: Lowering softmax temperature increases sparsity (reducing ghost mass) but eliminates secondary modes; raising temperature enhances multi-modality but increases residual probability on irrelevant entries (Zhou et al., 2024).
Sparse and Multi-Modal Generalizations: SparseMax and $n \to \infty$ 4-Entmax produce exact zeros but collapse multimodality at extreme parameters; Top-k Softmax or Ev-SoftMax introduce non-differentiability or require custom losses.
Adaptive Piecewise Modulation (MultiMax): The MultiMax function introduces a learnable piecewise $n \to \infty$ 5 transformation to stretch/compress entries adaptively, achieving both lower ghost mass and preserved multi-modality. The approach strictly Pareto-dominates standard softmax, empirically improving interpretability and sharpening attention in classification, language modeling, and translation (Zhou et al., 2024).

Method	Ghost Mass	Multi-Modality	Main Limitation
Softmax	Moderate	Moderate-High	Irreducible ghost mass
SparseMax	Low	Low	Collapses modes
MultiMax	Low	High	More complex

6. Representation Collapse and Rank-Deficit Bias

Temperature and logit-norm dynamics in softmax drive networks toward unexpected low-rank solutions:

Rank-Deficit Bias: Softmax-based architectures find feature representations whose empirical rank is much lower than the number of classes. This bias is exacerbated by high temperature (or small logits norm), leading to “ghostly” subspaces that provide high in-distribution accuracy but poor generalization and OOD coverage (Masarczyk et al., 2 Jun 2025).
Spectral Diagnostics: As temperature increases, singular values of pre-softmax representation matrices collapse, with only a few directions dominating. Effective depth and feature diversity fall; OOD generalization degrades, while OOD detection may improve.
Modeling Implications: Temperature tuning becomes a critical architectural and training knob: $n \to \infty$ 6 for representation compression; $n \to \infty$ 7 for richer features and transferability. Initialization and normalization interact as effective temperature modulators.

7. Remediation Strategies and Best Practices

Multiple research efforts converge on a set of interventions to mitigate softmax-induced ghosts:

Report Raw Logits or Margins: Logit values or class score margins are more sensitive to adversarial strength than softmax confidence (Ozbulak et al., 2018).
Monitor Attention Entropy and Dispersion: Growing sequence/context size should not cause uncontrolled rise in entropy; adaptive temperature can restore sharpness (Veličković et al., 2024).
Apply Step Size or Update Controllers: Enforce per-step bounds dictated by complex singularities (e.g., $n \to \infty$ 8), not only gradient norms (Sao, 13 Mar 2026).
Use Adaptive or Modulated Softmax Variants: MultiMax or entropy-stabilized softmax variants reduce ghost mass while retaining functional multimodality (Zhou et al., 2024).
Tune and Schedule Temperature: Adjust temperature dynamically for desired compression/generalization trade-off; architectural choices (normalization, initialization) should be calibrated accordingly (Masarczyk et al., 2 Jun 2025).
Be Skeptical of Circuit Interpretability at Large Scale: “Sharp” patterns seen at small input size may ghost out in larger contexts (Veličković et al., 2024).

Taken together, the “Ghosts of Softmax” constitute a multi-faceted set of mathematical and empirical effects that invisibly shape—and sometimes limit—the effectiveness, robustness, and interpretability of deep learning systems using softmax-based decision layers or attention modules. Systematic recognition of these ghosts has catalyzed new theoretical analyses, optimization-safe controllers, and attention mechanisms, though fundamental trade-offs among sharpness, expressivity, optimization stability, and representation diversity remain an area of active research (Ozbulak et al., 2018, Veličković et al., 2024, Hoffmann et al., 2023, Sao, 13 Mar 2026, Zhou et al., 2024, Masarczyk et al., 2 Jun 2025).