Bi-Tempered Softmax in Neural Classification

Updated 10 February 2026

Bi-Tempered Softmax is a two-parameter extension of softmax that applies different temperatures to the exponential and logarithmic functions.
It leverages Bregman divergence theory to produce bounded, proper loss functions that enhance robustness against label noise.
The method facilitates flexible learning dynamics with non-convex loss landscapes, offering a robust alternative to standard softmax in noisy settings.

Bi-Tempered Softmax refers to a two-parameter extension of the standard softmax and cross-entropy framework for multiclass neural network classification. The formulation introduces separate temperature parameters to the exponential and logarithmic components, generalizing both the prediction (activation) and loss calculation. This approach is grounded in Bregman divergence theory and produces loss functions that are proper, can be made bounded, and confer significant robustness to noise, especially label noise. The bi-tempered architecture also facilitates loss landscapes that are non-convex even in single-layer cases, allowing a spectrum of trade-offs in learning dynamics and robustness (Amid et al., 2019).

1. Tempered Exponential and Logarithm Definitions

The bi-tempered framework uses two one-parameter deformations of the classic exponential and logarithm:

Tempered logarithm with temperature $t_1$ :

$\log_{t_1}(x)\coloneqq\frac{x^{1-t_1}-1}{1-t_1}, \qquad x>0.$

For $t_1\to 1$ , $\log_{t_1}(x)\to\ln(x)$ . For $0\le t_1<1$ , $\log_{t_1}(x)$ is bounded below by $-1/(1-t_1)$ .

Tempered exponential with temperature $t_2$ :

$\exp_{t_2}(u)\coloneqq\left[1 + (1 - t_2)u\right]_+^{1/(1-t_2)}, \qquad [a]_+\coloneqq\max\{a,0\}.$

As $t_2\to 1$ , $\exp_{t_2}(u)\to e^{u}$ . For $t_2>1$ , this exponential has a heavier negative tail than the standard exponential.

2. Bi-Tempered Softmax Activation

Given logits $\bm{a} = (a_1, \ldots, a_k) \in \mathbb{R}^k$ , the tempered softmax with temperature $t_2$ is obtained by computing a normalizer $\lambda_{t_2}(\bm{a})$ as the root of

$\sum_{i=1}^k \exp_{t_2}(a_i - \lambda_{t_2}(\bm{a})) = 1,$

yielding the predicted probabilities

$\hat{y}_i = \exp_{t_2}(a_i - \lambda_{t_2}(\bm{a})), \quad i=1,\ldots,k.$

If no bracket is clipped to zero, this is equivalent to

$\hat{y}_i = \frac{[1 + (1-t_2) a_i]^{1/(1-t_2)}}{\sum_{j=1}^k [1 + (1-t_2) a_j]^{1/(1-t_2)}}.$

The normalizer $\lambda_{t_2}(\bm{a})$ generally lacks a closed form and is solved by one-dimensional root-finding methods, such as binary search.

3. Bi-Tempered Logistic Loss

For a ground-truth distribution $\bm{y} \in \Delta^{k-1}$ and the model prediction $\hat{\bm{y}}$ , the bi-tempered logistic loss for temperatures $(t_1, t_2)$ is defined as

$L_{t_1,t_2}(\bm{a}|\bm{y}) = \sum_{i=1}^k \left[y_i \log_{t_1} y_i - y_i \log_{t_1} \hat{y}_i - \frac{1}{2 - t_1}(y_i^{2-t_1} - \hat{y}_i^{2-t_1})\right].$

For a one-hot target with $y_c=1$ , this simplifies to

$L_{t_1,t_2}(\bm{a}|y) = -\log_{t_1}(\hat{y}_c) - \frac{1}{2-t_1}\left(1 - \sum_{i=1}^k \hat{y}_i^{2-t_1}\right).$

4. Gradient and Optimization

The gradient of the bi-tempered logistic loss with respect to the logits involves both temperature parameters:

The softmax derivative:

$\frac{\partial \hat{y}_j}{\partial a_i} = \hat{y}_j^{t_2} \left(\delta_{ij} - \frac{\hat{y}_i^{t_2}}{\sum_{\ell} \hat{y}_\ell^{t_2}}\right).$

The loss derivative with respect to probabilities:

$\frac{\partial L_{t_1,t_2}}{\partial \hat{y}_i} = -y_i \hat{y}_i^{-t_1} + \hat{y}_i^{1-t_1} = \hat{y}_i^{-t_1}(\hat{y}_i - y_i).$

Applying the chain rule,

$\frac{\partial L_{t_1,t_2}}{\partial a_i} = \sum_{j=1}^k (\hat{y}_j - y_j)\, \hat{y}_j^{t_2-t_1} \left(\delta_{ij} - \frac{\hat{y}_i^{t_2}}{\sum_\ell \hat{y}_\ell^{t_2}}\right).$

Both $t_1$ and $t_2$ modulate the error weighting and Jacobian structure.

5. Theoretical Properties and Robustness

Key mathematical properties include:

Convexity: For $t_1 = t_2$ , the loss is a Bregman divergence and convex in $\bm{a}$ over its domain. For $t_1 < t_2$ , the loss is non-convex in $\bm{a}$ .
Boundedness: If $0 \le t_1 < 1$ , the tempered logarithm is bounded below, leading to bounded loss and limiting sensitivity to outliers.
Robustness to Noise: For $t_2 > 1$ , the heavy-tailed negative part of the tempered exponential increases the spread of $\hat{\bm{y}}$ , attenuating overfitting to small-margin or noisy examples.
Divergence Connections: The loss is derived from a Bregman divergence generated by

$F_{t_1}(\bm{y}) = \sum_i \left(y_i \log_{t_1}y_i + \frac{1}{2-t_1}(1 - y_i^{2-t_1})\right),$

which aligns up to affine terms with β-divergences and Tsallis divergences. The bi-tempered loss is a proper loss and is Bayes-risk consistent for the multiclass setting.

6. Implementation Aspects

The forward and backward computations are amenable to standard deep learning pipelines. The following pseudocode summarizes the essential computation steps ((Amid et al., 2019), Algorithm 6):

def BiTemperedSoftmaxLoss(a, y, t1, t2):
    # a: logits
    # y: target distribution (one-hot or otherwise)
    # t1 < 1, t2 > 1

    # Forward pass
    scores = np.maximum(1 + (1 - t2) * a, 0)
    scores = scores ** (1 / (1 - t2))
    Z = np.sum(scores)
    y_hat = scores / Z

    term1 = y * log_t1(y)
    term2 = y * log_t1(y_hat)
    term3 = (y ** (2 - t1) - y_hat ** (2 - t1)) / (2 - t1)
    loss = np.sum(term1 - term2 - term3)
    return loss

In practice, frameworks use automatic differentiation for the backward pass or explicit coding of the aforementioned derivatives. For numerically stable and efficient inference, the normalization parameter $\lambda_{t_2}$ is routinely found by root-finding over a single dimension.

7. Context, Relations, and Applications

Bi-tempered softmax generalizes and unifies a family of robust losses previously studied, improving upon prior two-temperature schemes utilizing the Tsallis divergence, as shown empirically and theoretically (Amid et al., 2019). The robustness conferred by boundedness (for appropriate $t_1$ ) and heavy-tailed prediction distributions (for $t_2 > 1$ ) is especially advantageous in settings with high label noise. The methodology is applicable as a drop-in replacement for standard softmax/cross-entropy layers in deep neural networks, requiring only tuning of the two temperatures. The foundational analysis by E. Amid, M. Warmuth, R. Anil, and T. Koren established superior noise-robust performance on large datasets and clarified the mathematical underpinnings in terms of Bregman divergences. The bi-tempered approach further connects to the broader literature on proper scoring rules and generalized information divergences.

Markdown Report Issue Upgrade to Chat

References (1)

Robust Bi-Tempered Logistic Loss Based on Bregman Divergences (2019)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Bi-Tempered Softmax.

Bi-Tempered Softmax in Neural Classification

1. Tempered Exponential and Logarithm Definitions

2. Bi-Tempered Softmax Activation

3. Bi-Tempered Logistic Loss

4. Gradient and Optimization

5. Theoretical Properties and Robustness

6. Implementation Aspects

7. Context, Relations, and Applications

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Don't miss out on important new AI/ML research

Bi-Tempered Softmax in Neural Classification

1. Tempered Exponential and Logarithm Definitions

2. Bi-Tempered Softmax Activation

3. Bi-Tempered Logistic Loss

4. Gradient and Optimization

5. Theoretical Properties and Robustness

6. Implementation Aspects

7. Context, Relations, and Applications

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Related Topics

Don't miss out on important new AI/ML research

Sign up for free to explore the frontiers of research