Z-Loss: Scale & Shift Invariant Loss

Updated 17 February 2026

Z-loss is a classification loss function that normalizes pre-activation vectors to achieve both shift and scale invariance, improving robustness in multi-class settings.
It belongs to the spherical family, offering computational efficiency with per-example cost independent of the number of output classes, making it ideal for extreme classification.
The formulation uses a softplus function with two hyperparameters, enabling tuning to approximate ranking-based metrics like top-k error rates.

The Z-loss is a classification loss function designed to address computational and statistical limitations of log-softmax for large-scale multi-class neural networks. It achieves efficiency by being a member of the spherical family, thereby enabling training with complexity independent of the number of output classes $D$ . In addition, the Z-loss exhibits both shift and scale invariance, aligning it more closely with rank-based evaluation metrics such as top- $k$ error rates. Its formulation incorporates a normalization of the pre-activation vector followed by a softplus nonlinearity, parameterized by two hyperparameters.

1. Formal Definition

Given a pre-activation vector $o \in \mathbb{R}^D$ and a target index $c$ , the Z-loss is defined using the statistics:

$\mu = \frac{1}{D} \sum_k o_k$
$\sigma^2 = \frac{1}{D} \sum_k o_k^2 - \mu^2$
$z_k = \frac{o_k - \mu}{\sigma}$

Introducing hyperparameters $a > 0$ and $b \in \mathbb{R}$ , the Z-loss for the correct class $c$ is

$L_Z(o, c) = \frac{1}{a} \log\left[1 + \exp\left(a\left(b - \frac{o_c - \mu}{\sigma}\right)\right)\right]$

The loss takes the form of a softplus function applied to the Z-normalized target score, controlled by $a$ (sharpness) and $b$ (location).

2. Mathematical Invariances

The Z-loss is uniquely invariant under both affine shift and scaling of the pre-activation vector:

Shift invariance: Adding a constant $\alpha$ to all $o_k$ shifts $\mu$ equally, leaving $z_k$ and thus $L_Z$ unchanged.
Scale invariance: Multiplying all $o_k$ by a nonzero $\beta$ scales both $\mu$ and $\sigma$ , so $z_k$ rescales by $\operatorname{sign}(\beta)$ , which is compensated by tuning $b$ .

By contrast, the log-softmax $L_S(o, c) = -o_c + \log \sum_k e^{o_k}$ is only shift-invariant, not scale-invariant. The additional invariance of Z-loss reflects the property of rank-based metrics, which depend solely on the sorted order of $o$ .

3. Spherical Family Membership and Computational Efficiency

A loss $L(o, c)$ is in the spherical family if $L = F(\sum_k o_k,\, \sum_k o_k^2,\, o_c)$ . The Z-loss satisfies this property, as it relies exclusively on $\mu$ , $\sigma^2$ , and $o_c$ . As demonstrated by Vincent et al. (2015), spherical losses allow the use of low-rank factorizations for the output weight matrix $W \in \mathbb{R}^{D \times d}$ and the maintenance of summary statistics, resulting in per-example computation cost of $O(d^2)$ — irrespective of the output size $D$ . This is a significant advantage over log-softmax, for which naive gradient computation requires $O(dD)$ complexity.

4. Comparison to Log-Softmax and Alternative Losses

Computational Complexity and Stability

Loss function	Per-example complexity	Shift-invariant	Scale-invariant
Log-softmax	$O(dD)$	Yes	No
Hierarchical Softmax	$O(\sqrt{D})$	Yes	No
Z-loss	$O(d^2)$	Yes	Yes

Numerical stability is enhanced in the Z-loss by Z-normalization, which bounds $z_c$ in $[-\sqrt{D-1},+\sqrt{D-1}]$ . The softplus nonlinearity ensures gradients vanish for large $z_c$ , which prevents the explosion of pre-activation magnitudes. In contrast, the gradient of log-softmax, $\partial L / \partial o_c = -1 + \text{softmax}_c(o)$ , never vanishes, and $o_c$ can become unbounded.

Theoretical Distinctions

The two hyperparameters $(a, b)$ of Z-loss provide additional flexibility for surrogate loss shaping, enabling closer approximation to various ranking losses relevant for top- $k$ metrics. Log-softmax does not offer such flexibility, as its "temperature" can be absorbed by rescaling $W$ .

5. Empirical Evaluation

Penn Tree Bank (D=10,000)

Z-loss was evaluated against MSE, Taylor-softmax, cross-entropy-sigmoid, and log-softmax using top- $k$ error rates and Mean Reciprocal Rank. Z-loss produced the best error rates for $k \geq 5$ , with competitive performance for $k=1$ . Cross-entropy-sigmoid outperformed log-softmax for top-1, while MSE and Taylor-softmax underperformed for large $D$ . The parameter $a$ was found critical; higher values improved high- $k$ metrics. Z-normalization alone brought only modest improvements.

One Billion Word Dataset (D=793,471)

For a network with 793,471 outputs, the Z-loss enabled an epoch processing time of 2.81 hours (whole model), compared to 4.56 days for naïve softmax and 12.23 hours for hierarchical softmax. Final test top-1/top-20 error rates and total training time:

Loss	Architecture	Top-1	Top-20	Training time
Constant	—	95.44%	65.58%	—
Softmax	net1	—	—	≈ 40 days
H-softmax	net1	71.0%	35.73%	4.08 days
Z-loss	net1	72.13%	36.43%	0.97 days
Z-loss	net2	70.77%	38.29%	3.14 days

The Z-loss maintained superior training efficiency, with performance competitive to hierarchical softmax and substantially faster convergence.

6. Hyperparameterization, Adaptation, and Practical Usage

The hyperparameters $a$ and $b$ in

$L_Z(o, c) = \frac{1}{a} \log\left[1 + \exp\left(a\left(b - \frac{o_c - \mu}{\sigma}\right)\right)\right]$

are central to tailoring Z-loss for particular ranking objectives (e.g., different $k$ in top- $k$ ). Empirical sweeps over $a \in [0.01, 10]$ and $b \in [0, 50]$ are recommended, typically executed in initial training epochs. Z-loss’s computational advantage is maximized for large $D$ , making it well-suited for extreme classification and large-scale language modeling.

The combination of shift and scale invariance, coupled with bounded gradients, offers improved numerical properties and mitigates the risk of feature explosion.

7. Extensions and Future Directions

It has been observed that applying Z-normalization alone (removing the softplus) to other losses introduces tunable hyperparameters, but the combined Z-loss configuration outperforms these alternatives in empirical evaluations. A suggested direction is the dynamic adaptation of $(a, b)$ during training to maintain optimal alignment of the surrogate loss with the evolving task metric.

The Z-loss thus exemplifies a simple yet computationally efficient loss function that leverages spherical family structure and affine invariance, supporting state-of-the-art training regimes for classification problems with very large output spaces (Brébisson et al., 2016).

Markdown Upgrade to Chat

References (1)

The Z-loss: a shift and scale invariant classification loss belonging to the Spherical Family (2016)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Z-Loss.