Zero-Temperature Softmax

Updated 10 February 2026

Zero-temperature softmax is a limiting case of the softmax function where T → 0, resulting in a hard, deterministic one-hot selection.
It underpins hard attention, discrete sampling, and differentiable categorical relaxations, such as the Gumbel-Softmax, in neural models.
Practical challenges include vanishing gradients, rank collapse, and dispersion in long sequences, necessitating careful temperature tuning.

Zero-temperature softmax refers to the limiting behavior of the temperature-controlled softmax function as the temperature parameter $T \to 0^+$ . In this regime, the softmax outputs approach a hard one-hot assignment, selecting the argmax of the input logits deterministically. This concept is central to discrete decision-making, hard attention, categorical relaxations (such as the Gumbel-Softmax), and the analysis of neural network learning dynamics. The zero-temperature limit exposes both mathematical structure and practical issues in differentiation, optimization, and generalization.

1. Mathematical Formalism of Temperature-Scaled Softmax

The temperature-scaled softmax for input logits $z = (z_1, \ldots, z_c) \in \mathbb{R}^c$ and $T > 0$ is defined as

$\operatorname{softmax}_T(z)_i = \frac{\exp(z_i/T)}{\sum_{j=1}^c \exp(z_j/T)}$

for $i=1,\ldots, c$ . This parameterization allows explicit control over the "sharpness" of the output distribution: $T \gg 1$ yields near-uniform assignments, while $T \ll 1$ produces sharper distributions. Equivalently,

$\operatorname{softmax}_T(z) = \operatorname{softmax}_1(z/T),$

so lowering $T$ amplifies the impact of logit differences (Masarczyk et al., 2 Jun 2025, Veličković et al., 2024, Lee-Jenkins, 28 Aug 2025).

At the variational level, softmax arises as the unique maximizer of the "free energy"

$\mathcal{F}(p) = \langle p, \phi \rangle + T H(p)$

over the probability simplex $\Delta$ , with $H(p) = -\sum_i p_i \log p_i$ the Shannon entropy. The Karush-Kuhn-Tucker (KKT) conditions yield the same temperature-scaled softmax expression for the optimum $p^*$ (Lee-Jenkins, 28 Aug 2025).

2. Zero-Temperature Limit: Hard Argmax and Discrete Distributions

As $T \to 0^+$ , the softmax concentrates its entire probability mass on the maximal element of $z$ . If $i^* = \arg\max_i z_i$ is unique, then

$\lim_{T \to 0^+} \operatorname{softmax}_T(z)_i = \delta_{i,i^*}.$

This is a deterministic one-hot ("hard") decision selecting the argmax; non-maximal entries vanish exponentially fast in $1/T$: $\operatorname{softmax}_T(z)_i \propto e^{(z_i - z_{i^*})/T} \to 0 \ \text{for}\ i \ne i^*,$ recovering the classical argmax rule. This limiting behavior underpins hard attention, discrete sampling, and deterministic routing in neural models (Masarczyk et al., 2 Jun 2025, Veličković et al., 2024, Lee-Jenkins, 28 Aug 2025, Jang et al., 2016).

3. Categorical and Gumbel-Softmax Relaxations

The Gumbel-Softmax ("Concrete") distribution facilitates differentiable relaxation of categorical sampling. For categorical probabilities $\pi$ and independent Gumbel noise $g_i$ , softmax-based sampling is defined as

$y_i = \frac{\exp((\log \pi_i + g_i)/\tau)}{\sum_j \exp((\log \pi_j + g_j)/\tau)}$

with temperature $\tau>0$ . As $\tau \to 0^+$ , $y$ converges to a one-hot vector at index $\arg\max_i (\log \pi_i + g_i)$ , recovering an exact categorical sample via the Gumbel-Max trick (Jang et al., 2016). For intermediate $\tau$ , $y$ interpolates smoothly between uniform and categorical extremes.

The straight-through Gumbel-Softmax (ST-GS) estimator enforces hard argmax discretization in the forward pass ( $\tau^f \to 0$ ), while using a continuous softmax with moderate temperature in the backward pass ( $\tau^b > 0$ ) for gradient computation. This decoupling improves both exactness and gradient fidelity in discrete optimization tasks, outperforming single-temperature estimators in autoencoding and generative settings (Shah et al., 2024).

4. Optimization, Gradient Behavior, and Training Dynamics at Zero Temperature

In softmax-based models trained via the cross-entropy loss, introducing inverse temperature $\beta = 1/T$ yields

$\sigma_i(Z) = \frac{\exp(\beta z_i)}{\sum_j \exp(\beta z_j)}.$

As $\beta \to \infty$ ( $T \to 0$ ), the model outputs approach a one-hot vector aligned with the maximal logit, but crucially:

The gradient $\nabla_\theta L \sim \beta (y - \sigma(Z))$ becomes exponentially small as $y - \sigma(Z) \sim e^{-\beta \Delta z}$ (where $\Delta z$ is the logit gap), leading to vanishing updates.
The network remains in the linear (NTK) regime for extended periods ( $\tau_{nl} \propto \beta$ ), impeding feature learning.
Convergence becomes excessively slow, with model weights barely moving from initialization, and generalization degrades (Agarwala et al., 2020).

Careful annealing or tuning of $\beta$ is recommended, with optimal values architecture-dependent and typically in the range $10^{-2} \leq \beta^* \leq 10^1$ .

5. Representation Theory, Rank Collapse, and Generalization Effects

At very low temperature, the softmax sharpens, and the network's learned representations (rows of the logit matrix across input samples) exhibit rank-deficit bias: numerical ranks much lower than the class count. Empirical findings indicate:

At low $T$ , hidden-layer matrix rank drops sharply, fewer layers contribute meaningfully to linear separability, and gradients also collapse in rank.
Increasing logit norm (decreasing $T$ ) raises post-softmax matrix rank but, during training, high $T$ solutions often exhibit persistently low pre-softmax rank.
Rank collapse impairs out-of-distribution generalization while sometimes improving compressed model performance (Masarczyk et al., 2 Jun 2025).

Explicit temperature tuning, architectural parameters (e.g., width, normalization), and annealing schedules can balance sharpness, representation diversity, and numerical stability. Low $T$ can require smaller learning rates and careful regularization.

6. Zero-Temperature Softmax in Attention and Sequence Models: Dispersion and Limitations

In attention mechanisms (e.g., Transformers), zero-temperature softmax yields exact selection of the token with maximal similarity (hard attention). For any fixed $T > 0$ , as sequence length $N$ grows, the maximal attention weight disperses: $\frac{1}{N} \exp\left(-\frac{M - m}{T}\right) \le \alpha_i \le \frac{1}{N} \exp\left(\frac{M - m}{T}\right)$ where $[m, M]$ bounds all logits. As $N \to \infty$ , $\max_i \operatorname{softmax}_T(e)_i \to 0$ for fixed $T$ —sharpness is lost on long sequences. Adaptive temperatures based on entropy can partially restore sharpness at inference but cannot overcome the limitation for arbitrarily large $N$ (Veličković et al., 2024). Non-softmax attention mechanisms or explicit discrete routing are necessary for indefinitely sharp, scale-robust reasoning.

7. Dynamical Perspective: Replicator Dynamics and the Time-Rescaling Role of Temperature

The evolution of output probabilities under entropic mirror ascent with temperature $T$ yields, in continuous time, the replicator equation: $\dot{p}_i = \frac{1}{T} p_i (\phi_i - \langle p, \phi \rangle)$ where $\phi$ is the score vector. Temperature appears as a pure time rescaling parameter: $p_T(t) = p_1(t/T)$ . As $T \to 0^+$ , the trajectory in probability simplex space collapses exponentially fast onto the vertex corresponding to the argmax of $\phi$ , with convergence rates determined by logit gaps. This interpretation provides a principled theoretical link between zero-temperature softmax and argmax selection as a dynamical system (Lee-Jenkins, 28 Aug 2025).

Key References:

Gumbel-Softmax and categorical reparameterization: (Jang et al., 2016, Shah et al., 2024)
Softmax temperature, rank collapse, generalization: (Masarczyk et al., 2 Jun 2025, Agarwala et al., 2020)
Dispersion and adaptive temperature in attention: (Veličković et al., 2024)
Variational, mirror descent, and replicator dynamics perspectives: (Lee-Jenkins, 28 Aug 2025)