Spherical Loss Functions

Updated 17 February 2026

Spherical loss functions are defined by their dependence on three key statistics—sum, sum of squares, and true class output—ensuring rotational invariance.
They enable efficient computation through O(d²) algorithms, facilitating scalable training in extreme classification and large vocabulary tasks.
Variants like Taylor-softmax, spherical softmax, and Z-loss offer tailored invariance and performance benefits, balancing accuracy with computational efficiency.

The spherical family of loss functions comprises a distinct class of objective functions—primarily for multi-class classification and high-dimensional estimation—characterized by rotational invariance and efficient computability. In classification contexts, a loss is spherical if it depends only on three statistics of the model’s output vector: the sum of entries, the sum of squared entries, and the entry corresponding to the true class. These properties enable exact, output-size-independent training algorithms, and have motivated a spectrum of losses—such as Taylor-softmax, spherical softmax, and the Z-loss—each offering targeted invariance or computational advantages (Brébisson et al., 2016, Vincent et al., 2016, Brébisson et al., 2015). In estimation and shrinkage, spherical (orthogonally invariant) losses are similarly defined by their dependence on squared Euclidean distances, leading to minimax results and tractable dominant estimators (Hobbad et al., 2021).

1. Formal Definition and Characterization

A multi-class loss $L(o, c)$ is in the spherical family if it can be written as a function of three scalar summaries of the prediction vector $o \in \mathbb{R}^D$ (pre-activations):

$L(o, c) = F\left(\sum_{k=1}^D o_k,\ \sum_{k=1}^D o_k^2,\ o_c \right)$

for some $F:\mathbb{R}^3\to\mathbb{R}$ , where $c$ is the true class index. This dependency is exhaustive: no other combination of the output vector’s coordinates is admitted (Brébisson et al., 2016, Brébisson et al., 2015). Examples subsumed in this form include:

Mean-squared error (MSE) to one-hot labels: $L_{\mathrm{MSE}}(o,c)=\frac{1}{2}(\sum_k o_k^2 -2o_c+1)$
Taylor-softmax NLL: $L_T(o,c) = -\log t_c(o)$ for $t_i(o) = (1+o_i+\frac{1}{2}o_i^2)/\sum_k (1+o_k+\frac{1}{2}o_k^2)$

In estimation, spherical loss functions rely on orthogonal invariance: risk and loss depend only on squared Euclidean distances, leading to invariance under simultaneous rotation of all parameter and estimator vectors (Hobbad et al., 2021). Canonical forms include both “balanced” and “convex-combined” loss families:

$L_{\omega,\rho}(\delta, \theta) = \omega\rho(\|\delta - \delta_0(X)\|^2) + (1-\omega)\rho(\|\delta - \theta\|^2)$

$L_{\omega,\ell}(\delta, \theta) = \ell( \omega\|\delta - \delta_0(X)\|^2 + (1-\omega)\|\delta - \theta\|^2 )$

with $\rho$ , $\ell$ concave and increasing, and $\delta_0$ a target estimator (Hobbad et al., 2021).

2. Computational Advantages and Algorithms

The principal operational advantage of the spherical family in classification is that all gradients and weight updates can be computed without explicit construction or storage of high-dimensional output layers. For a neural net with last hidden state $h\in\mathbb{R}^d$ and output $o = Wh$ , the standard $O(Dd)$ cost (with $D$ classes) is replaced by $O(d^2)$ via factored output representations (Vincent et al., 2016). The procedure exploits the fact that the loss and its gradient depend only on:

$s = \sum_k o_k$ ( $O(d)$ to compute via maintained summary)
$q = \sum_k o_k^2$ ( $O(d^2)$ using Gram matrix)
$o_c$ , requiring access only to the true class row of $W$ , if needed.

These are achieved by factorizing $W=VU$ , maintaining Gram matrix $Q=W^\top W$ , mean vector $\bar{w}=W^\top 1_D$ , and using rank-one/rank-two updates after each step. This enables exact and scalable learning in extreme classification regimes (Vincent et al., 2016, Brébisson et al., 2016).

3. Spherical Loss Variants: Taxonomy and Properties

Multiple losses belong to the spherical family, each aligning with distinct invariance or empirical objectives:

Loss	Definition/Formula	Key Properties
MSE	$L_{\rm MSE} = \frac{1}{2}(\sum o_k^2 - 2o_c + 1)$	Spherical, convex
Taylor-softmax	$L_T(o,c) = -\log t_c(o)$ as above	Spherical, smooth, underperforms for large $D$
Spherical softmax	$p_i = o_i^2 / \sum_j o_j^2$ , $L = -\log p_c$	Spherical, scale-invariant, needs stabilization
Z-loss	$L_Z(z_c) = (1/a)\log(1+\exp(a(b-z_c)))$ with $z_k = (o_k-\mu)/\sigma$	Spherical, shift+scale invariant, tunable via $a,b$

All spherical losses support $O(d^2)$ updates and benefit from implicit orthogonality and competitive score dynamics among classes (Brébisson et al., 2016, Brébisson et al., 2015). The Z-loss, in particular, introduces shift- and scale-invariance by operating on Z-normalized outputs $z_c$ , and contains softplus nonlinearity and adjustable parameters for matching specific ranking losses (e.g., top- $k$ error rates) (Brébisson et al., 2016).

4. Empirical Performance and Evaluation

Empirical studies confirm that spherical-family losses are competitive, particularly in large-output settings or when alternative task metrics (such as top- $k$ accuracy) are prioritized. On Penn Tree Bank ( $D=10^4$ ), the Z-loss, after tuning, matched or outperformed softmax NLL in top-5 and top-20 error rates (Brébisson et al., 2016):

Loss	Top-1	Top-5	Top-10	Top-20
Softmax	30.5%	14.9%	10.8%	7.8%
Z-loss	30.7%	13.8%	10.0%	7.1%

For very large $D$ (e.g., One Billion Word, $D\approx8\times10^5$ ), Z-loss enabled exact gradient-based training up to 40 $\times$ faster than naive softmax and $\sim4\times$ faster than hierarchical softmax, at small cost in top- $k$ accuracy (Brébisson et al., 2016). On low-dimensional outputs (e.g., MNIST, CIFAR-10), spherical losses (especially log-Taylor-softmax) are often competitive or slightly superior to softmax (Brébisson et al., 2015), but for very large vocabularies, the standard softmax typically achieves the best perplexity and ranking metrics.

5. Spherical Family in Shrinkage Estimation

In estimation and statistical shrinkage, spherical loss functions are defined by orthogonal invariance—loss and risk depend only on squared distances—enabling extensions of Baranchik-type minimax shrinkage estimators to broad settings. Specifically, for any spherically symmetric data distribution in $\mathbb{R}^d$ (density $f(\|x-\theta\|^2)$ ), and for losses depending on $\|\delta - \delta_0\|^2$ and $\|\delta - \theta\|^2$ , explicit forms for minimax shrinkage estimators can be constructed that uniformly lower risk compared to the naive estimator $X$ (Hobbad et al., 2021). These results generalize classical James-Steinian and Brandwein-Strawderman risk bounds to more general concave penalty functions.

6. Spherical Family for Embedding Geometries

The spherical family also extends to losses enforcing spherical geometry in embeddings, such as those optimized over the unit hypersphere in metric learning and retrieval applications. Spherical softmax classifiers, as well as margin-based angular variants (e.g., ArcFace, CosFace, SphereFace), compute class probabilities as

$p(y \mid z) = \frac{\exp(\beta \cos \theta_y)}{\sum_{j=1}^Y \exp(\beta \cos \theta_j)}$

where $\|z\|_2 = 1$ , $\|w_j\|_2 = 1$ , and $\theta_j = \arccos(w_j^\top z)$ . Probabilistic variants such as the von Mises–Fisher (vMF) loss operate by modeling both embeddings and weights as stochastic samples from vMF distributions, with built-in uncertainty through concentration parameter $\kappa$ (Scott et al., 2021). Empirical comparisons indicate that spherical losses can consistently improve both accuracy and calibration across fixed-set and retrieval tasks, especially when normalization and angular discrimination are desired (Scott et al., 2021).

7. Practical Considerations and Implementation

Practical exploitation of the spherical family requires adherence to the restriction that the loss depend only on the designated sufficient statistics. Implementation of O( $d^2$ ) routines is then straightforward via matrix factorizations and cached summary statistics. For the Z-loss, insertion into modern deep learning frameworks (e.g., PyTorch) requires only modifications to output normalization and activation, with the option to activate the factored weight update for full $D$ -independent efficiency (Brébisson et al., 2016). Tuning of loss hyperparameters (e.g., $a, b$ for the Z-loss) is best performed using the actual evaluation metric. These techniques are highly recommended for tasks involving extreme-classification, large-vocabulary language modeling, or settings requiring specific invariance properties or top- $k$ performance. For low-output settings, log-Taylor-softmax may be favored for its stability and accuracy (Brébisson et al., 2015); for very large $D$ , Z-loss yields optimal computational scalability with minimal test loss degradation (Brébisson et al., 2016, Vincent et al., 2016).