Papers
Topics
Authors
Recent
Search
2000 character limit reached

Bi-Tempered Softmax in Neural Classification

Updated 10 February 2026
  • Bi-Tempered Softmax is a two-parameter extension of softmax that applies different temperatures to the exponential and logarithmic functions.
  • It leverages Bregman divergence theory to produce bounded, proper loss functions that enhance robustness against label noise.
  • The method facilitates flexible learning dynamics with non-convex loss landscapes, offering a robust alternative to standard softmax in noisy settings.

Bi-Tempered Softmax refers to a two-parameter extension of the standard softmax and cross-entropy framework for multiclass neural network classification. The formulation introduces separate temperature parameters to the exponential and logarithmic components, generalizing both the prediction (activation) and loss calculation. This approach is grounded in Bregman divergence theory and produces loss functions that are proper, can be made bounded, and confer significant robustness to noise, especially label noise. The bi-tempered architecture also facilitates loss landscapes that are non-convex even in single-layer cases, allowing a spectrum of trade-offs in learning dynamics and robustness (Amid et al., 2019).

1. Tempered Exponential and Logarithm Definitions

The bi-tempered framework uses two one-parameter deformations of the classic exponential and logarithm:

  • Tempered logarithm with temperature t1t_1:

logt1(x)x1t111t1,x>0.\log_{t_1}(x)\coloneqq\frac{x^{1-t_1}-1}{1-t_1}, \qquad x>0.

For t11t_1\to 1, logt1(x)ln(x)\log_{t_1}(x)\to\ln(x). For 0t1<10\le t_1<1, logt1(x)\log_{t_1}(x) is bounded below by 1/(1t1)-1/(1-t_1).

  • Tempered exponential with temperature t2t_2:

expt2(u)[1+(1t2)u]+1/(1t2),[a]+max{a,0}.\exp_{t_2}(u)\coloneqq\left[1 + (1 - t_2)u\right]_+^{1/(1-t_2)}, \qquad [a]_+\coloneqq\max\{a,0\}.

As t21t_2\to 1, expt2(u)eu\exp_{t_2}(u)\to e^{u}. For t2>1t_2>1, this exponential has a heavier negative tail than the standard exponential.

2. Bi-Tempered Softmax Activation

Given logits a=(a1,,ak)Rk\bm{a} = (a_1, \ldots, a_k) \in \mathbb{R}^k, the tempered softmax with temperature t2t_2 is obtained by computing a normalizer λt2(a)\lambda_{t_2}(\bm{a}) as the root of

i=1kexpt2(aiλt2(a))=1,\sum_{i=1}^k \exp_{t_2}(a_i - \lambda_{t_2}(\bm{a})) = 1,

yielding the predicted probabilities

y^i=expt2(aiλt2(a)),i=1,,k.\hat{y}_i = \exp_{t_2}(a_i - \lambda_{t_2}(\bm{a})), \quad i=1,\ldots,k.

If no bracket is clipped to zero, this is equivalent to

y^i=[1+(1t2)ai]1/(1t2)j=1k[1+(1t2)aj]1/(1t2).\hat{y}_i = \frac{[1 + (1-t_2) a_i]^{1/(1-t_2)}}{\sum_{j=1}^k [1 + (1-t_2) a_j]^{1/(1-t_2)}}.

The normalizer λt2(a)\lambda_{t_2}(\bm{a}) generally lacks a closed form and is solved by one-dimensional root-finding methods, such as binary search.

3. Bi-Tempered Logistic Loss

For a ground-truth distribution yΔk1\bm{y} \in \Delta^{k-1} and the model prediction y^\hat{\bm{y}}, the bi-tempered logistic loss for temperatures (t1,t2)(t_1, t_2) is defined as

Lt1,t2(ay)=i=1k[yilogt1yiyilogt1y^i12t1(yi2t1y^i2t1)].L_{t_1,t_2}(\bm{a}|\bm{y}) = \sum_{i=1}^k \left[y_i \log_{t_1} y_i - y_i \log_{t_1} \hat{y}_i - \frac{1}{2 - t_1}(y_i^{2-t_1} - \hat{y}_i^{2-t_1})\right].

For a one-hot target with yc=1y_c=1, this simplifies to

Lt1,t2(ay)=logt1(y^c)12t1(1i=1ky^i2t1).L_{t_1,t_2}(\bm{a}|y) = -\log_{t_1}(\hat{y}_c) - \frac{1}{2-t_1}\left(1 - \sum_{i=1}^k \hat{y}_i^{2-t_1}\right).

4. Gradient and Optimization

The gradient of the bi-tempered logistic loss with respect to the logits involves both temperature parameters:

  • The softmax derivative:

y^jai=y^jt2(δijy^it2y^t2).\frac{\partial \hat{y}_j}{\partial a_i} = \hat{y}_j^{t_2} \left(\delta_{ij} - \frac{\hat{y}_i^{t_2}}{\sum_{\ell} \hat{y}_\ell^{t_2}}\right).

  • The loss derivative with respect to probabilities:

Lt1,t2y^i=yiy^it1+y^i1t1=y^it1(y^iyi).\frac{\partial L_{t_1,t_2}}{\partial \hat{y}_i} = -y_i \hat{y}_i^{-t_1} + \hat{y}_i^{1-t_1} = \hat{y}_i^{-t_1}(\hat{y}_i - y_i).

Applying the chain rule,

Lt1,t2ai=j=1k(y^jyj)y^jt2t1(δijy^it2y^t2).\frac{\partial L_{t_1,t_2}}{\partial a_i} = \sum_{j=1}^k (\hat{y}_j - y_j)\, \hat{y}_j^{t_2-t_1} \left(\delta_{ij} - \frac{\hat{y}_i^{t_2}}{\sum_\ell \hat{y}_\ell^{t_2}}\right).

Both t1t_1 and t2t_2 modulate the error weighting and Jacobian structure.

5. Theoretical Properties and Robustness

Key mathematical properties include:

  • Convexity: For t1=t2t_1 = t_2, the loss is a Bregman divergence and convex in a\bm{a} over its domain. For t1<t2t_1 < t_2, the loss is non-convex in a\bm{a}.
  • Boundedness: If 0t1<10 \le t_1 < 1, the tempered logarithm is bounded below, leading to bounded loss and limiting sensitivity to outliers.
  • Robustness to Noise: For t2>1t_2 > 1, the heavy-tailed negative part of the tempered exponential increases the spread of y^\hat{\bm{y}}, attenuating overfitting to small-margin or noisy examples.
  • Divergence Connections: The loss is derived from a Bregman divergence generated by

Ft1(y)=i(yilogt1yi+12t1(1yi2t1)),F_{t_1}(\bm{y}) = \sum_i \left(y_i \log_{t_1}y_i + \frac{1}{2-t_1}(1 - y_i^{2-t_1})\right),

which aligns up to affine terms with β-divergences and Tsallis divergences. The bi-tempered loss is a proper loss and is Bayes-risk consistent for the multiclass setting.

6. Implementation Aspects

The forward and backward computations are amenable to standard deep learning pipelines. The following pseudocode summarizes the essential computation steps ((Amid et al., 2019), Algorithm 6):

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
def BiTemperedSoftmaxLoss(a, y, t1, t2):
    # a: logits
    # y: target distribution (one-hot or otherwise)
    # t1 < 1, t2 > 1

    # Forward pass
    scores = np.maximum(1 + (1 - t2) * a, 0)
    scores = scores ** (1 / (1 - t2))
    Z = np.sum(scores)
    y_hat = scores / Z

    term1 = y * log_t1(y)
    term2 = y * log_t1(y_hat)
    term3 = (y ** (2 - t1) - y_hat ** (2 - t1)) / (2 - t1)
    loss = np.sum(term1 - term2 - term3)
    return loss

In practice, frameworks use automatic differentiation for the backward pass or explicit coding of the aforementioned derivatives. For numerically stable and efficient inference, the normalization parameter λt2\lambda_{t_2} is routinely found by root-finding over a single dimension.

7. Context, Relations, and Applications

Bi-tempered softmax generalizes and unifies a family of robust losses previously studied, improving upon prior two-temperature schemes utilizing the Tsallis divergence, as shown empirically and theoretically (Amid et al., 2019). The robustness conferred by boundedness (for appropriate t1t_1) and heavy-tailed prediction distributions (for t2>1t_2 > 1) is especially advantageous in settings with high label noise. The methodology is applicable as a drop-in replacement for standard softmax/cross-entropy layers in deep neural networks, requiring only tuning of the two temperatures. The foundational analysis by E. Amid, M. Warmuth, R. Anil, and T. Koren established superior noise-robust performance on large datasets and clarified the mathematical underpinnings in terms of Bregman divergences. The bi-tempered approach further connects to the broader literature on proper scoring rules and generalized information divergences.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Bi-Tempered Softmax.