Asymmetric Loss (ASL) Overview

Updated 26 March 2026

Asymmetric Loss (ASL) is a loss function that penalizes over- and under-prediction differently, providing fine-grained control over error asymmetry.
It employs weighted minimization and tunable hyperparameters to capture domain-specific risk and improve robustness under data noise.
Empirical results in multilabel and segmentation tasks demonstrate improved mAP and noise tolerance compared to symmetric loss alternatives.

Asymmetric Loss (ASL) denotes a class of loss functions in statistical learning and optimization that introduce systematic imbalance—via hyperparameters or function form—in the way over-prediction and under-prediction, or positive vs. negative class errors, are penalized. This stands in contrast to symmetric losses, which assign equal cost to equivalent errors regardless of direction. The asymmetric loss paradigm provides a principled means to capture domain-specific risk sensitivities, improve learning under distributional imbalance (e.g., class or error-type), and enhance robustness under data or label noise.

1. Formal Definitions and Foundational Properties

Formally, let $\mathcal{Y}$ denote the label space (e.g., for classification or regression), and $L:\mathcal{C}\times\mathcal{Y}\to\mathbb{R}$ a loss, with $\mathcal{C}$ the prediction space (probability simplex, $\mathbb{R}$ , etc). Asymmetric loss functions can be characterized via a weighted minimization principle: given nonnegative weights $w_1,\ldots,w_K$ with a unique maximizer $t$ (the "dominant class"),

$\arg\min_{u\in\mathcal{C}} \sum_{i=1}^K w_i L(u,i) = \arg\min_{u\in\mathcal{C}} L(u, t).$

This property ensures that minimizing the expected asymmetric loss pushes probability mass (or predictions) preferentially toward the class or outcome with maximal weight, providing fine-grained control over error asymmetry (Zhou et al., 2021). Losses with this property are termed completely asymmetric when it holds for all weight choices, or strictly asymmetric when additionally the decrease in weighted loss is strict if the dominant class probability is increased.

Analytical properties of notable asymmetric loss families include continuity, convexity, and links to Bregman or $\phi$ -divergences (e.g., the power-divergence loss, which interpolates between Kullback–Leibler, Pearson $\chi^2$ , and others via an explicit asymmetry parameter) (Pearse et al., 2024).

The asymmetry ratio,

$r(\ell) = \inf_{\substack{u_1, u_2\geq 0;\ u_1+u_2\leq 1;\ 0\leq\Delta\leq u_2}}\frac{\ell(u_1)-\ell(u_1+\Delta)}{\ell(u_2-\Delta)-\ell(u_2)}$

quantifies how strongly the loss "pushes" mass from non-dominant to dominant classes. For weights $w_m>w_n$ , the requirement $(w_m/w_n)\cdot r(\ell)\geq 1$ is critical for ensuring asymmetry (Zhou et al., 2021).

2. Canonical Forms of Asymmetric Loss

Multiple forms of asymmetric loss have been derived or adapted for classification, regression, spatial prediction, and segmentation. Key representatives include:

Loss Family	Formula/Definition (summary)	Key Parameters
Piecewise-linear (quantile)	$L(\varepsilon)=k_1\varepsilon$ if $\varepsilon\geq 0$ , $-k_2\varepsilon$ if $\varepsilon<0$	$k_1, k_2>0$
Power-divergence [PDL]	$L_{PDL,\lambda}(\delta,Y)=\frac{1}{\lambda(\lambda+1)} \{Y[(Y/\delta)^\lambda-1]+\lambda(\delta-Y)\}$	$\lambda\in\mathbb{R}$
Asymmetric Loss (ASL)	$L_{ASL}(p,y)=-[ (1-p)^{\gamma_+}\log p \ \text{if}\ y=1;\ (p_t)^{\gamma_-}\log (1-p_t)\ \text{if}\ y=0 ]$ with $p_t=\max\{p-t,0\}$	$\gamma_+,\gamma_-, t$
Tversky/F $_\beta$	$F_\beta(P,G) = \frac{(1+\beta^2)\sum p_ig_i}{(1+\beta^2)\sum p_ig_i + \beta^2 \sum (1-p_i)g_i + \sum p_i(1-g_i)}$	$\beta$ (recall–precision tradeoff)
Polynomial-based	Summations of multiple terms in $(1-\hat y)$ / $\hat y_t$ ; see APL/RAL	$M,N,\{\alpha_m\},\{\beta_n\}$
AMSE (classification)	$L_{AMSE}(f(x),y)=\tfrac{1}{K}\\| a\mathbf{e}_y - f(x)\\|_2^2$	$a\geq 1$

The power parameter or exponents (e.g., $\lambda$ , $\gamma_+$ , $\gamma_-$ ) control the direction and degree of penalty asymmetry (Pearse et al., 2024, Ben-Baruch et al., 2020).

3. Theoretical Guarantees and Robustness

Asymmetric losses possess several desirable properties for statistical learning:

Classification Calibration: For appropriately designed asymmetric losses (e.g., strictly and completely asymmetric), driving the excess loss risk to zero ensures vanishing excess $0$–$1$ risk.
Excess Risk Bounds: For loss $\ell$ with $\ell(0)>\ell(1)$ , the excess misclassification risk is bounded by the excess asymmetric loss risk: $R_{0-1}(f)-R_{0-1}^* \leq [2/(\ell(0)-\ell(1))][R_{\ell}(f)-R_{\ell}^*]$ (Zhou et al., 2021).
Noise Tolerance: Under clean-label-dominant noise ( $1-\eta_y>\max_{k\neq y}\eta_{y,k}$ ), any completely asymmetric loss is robust; the global minimum under the noisy risk coincides with the minimum under the clean risk (Wang et al., 23 Jul 2025, Zhou et al., 2021).
Variance Reduction: For piecewise-linear asymmetric losses, adding an optimal correction to the prediction not only minimizes mean asymmetric loss but also guarantees a strictly reduced variance of the loss unless the loss is symmetric (Yamaguchi et al., 2019).

4. Asymmetric Loss in Multilabel and Imbalanced Classification

In settings where positive instances are sparse relative to negatives (e.g., multilabel classification, long-tailed data), symmetric losses lead to gradient domination by negatives and poor fitting of positives. ASL decouples positive and negative exponentiation and introduces a hard threshold to entirely discard easy negatives, preserving scarce positive signal and enhancing performance.

Empirical performance comparisons:

On MS-COCO, default ASL ( $\gamma_+=0,\gamma_-=4,t=0.05$ ) yields absolute mAP improvements over focal loss: $86.6\%$ vs. $85.1\%$ (Ben-Baruch et al., 2020).
On Open Images, macro mAP increases from $92.2\%$ (focal) to $92.8\%$ (ASL).
For medical long-tailed multi-label (CXR-LT), robust polynomial ASL with Hill regularization further improves mAP, mAUC, and F1 beyond BCE, focal, and plain ASL (Park et al., 2023).

The Hill loss regularization in robust ASL (RAL) caps gradients on hard negatives, preventing hyperparameter sensitivity/instability with polynomial-based asymmetric losses (Park et al., 2023).

5. Asymmetric Loss in Regression and Spatial Prediction

For positive-valued targets, standard symmetric error metrics do not match the natural multiplicative structure of errors or the cost structure in applications. The power-divergence family generalizes classical divergences with a tunable $\lambda$ to encode cost asymmetry between under- and over-prediction:

$\lambda=1$ : symmetric penalization (Pearson $\chi^2$ ).
$\lambda<1$ : over-prediction penalized more.
$\lambda>1$ : under-prediction penalized more (Pearse et al., 2024).

Optimal estimators under $L_{PDL,\lambda}$ often take the form of power means of the posterior target. Prediction intervals defined by the asymmetric loss can be computed analytically (for some $\lambda$ ) or numerically, providing tailored uncertainty quantification that matches the loss structure.

A quantitative measure $A(f)=L((1-f)Y,Y)/L((1-f)^{-1}Y,Y)$ expressed directly in terms of the loss enables selection and interpretation of the asymmetry magnitude for practical decision costs (Pearse et al., 2024).

6. Algorithmic and Implementation Considerations

The majority of asymmetric losses admit efficient implementation within standard autodiff frameworks. For example, ASL for multilabel tasks is given by:

def asymmetric_loss(logits, targets, gamma_pos=0, gamma_neg=4, margin=0.05, eps=1e-8):
    p = torch.sigmoid(logits)
    pos_loss = -((1 - p).pow(gamma_pos) * torch.log(p.clamp(min=eps))) * targets
    p_shift = torch.clamp(p - margin, min=0.0)
    neg_loss = -(p_shift.pow(gamma_neg) * torch.log((1 - p_shift).clamp(min=eps))) * (1 - targets)
    return (pos_loss + neg_loss).mean()

(Ben-Baruch et al., 2020)

Polynomial-based losses and their regularized variants (e.g., RAL) generalize this form, adding negligible computation overhead for typical polynomial degrees ( $M,N\leq 3$ ) (Park et al., 2023).

The power-divergence estimator under a hierarchical spatial model uses posterior calculation of powers or logs; prediction intervals are constructed via quantile calculation or numerical root-finding (Pearse et al., 2024).

7. Applications, Extensions, and Empirical Results

Imbalanced medical segmentation: Asymmetric Tversky/F $_\beta$ losses improve recall at fixed or minor precision trade-off; in MS lesion segmentation, ASL increased recall by $~4\%$ and improved lesion-wise true positive rate over Dice and focal losses (Hashemi et al., 2018).
Learning with noisy labels: Asymmetric losses have been developed (AGCE, AUL, AMSE) and shown to provide enhanced robustness to both symmetric and class-conditional noise, outperforming symmetric counterparts on synthetic high-noise and real-world (WebVision, Clothing1M) benchmarks (Wang et al., 23 Jul 2025, Zhou et al., 2021).
Decision-theoretic corrections: Piecewise-linear asymmetric losses with explicit bias adjustment simultaneously minimize both mean and variance of downstream risk, providing a robust correction layer atop arbitrary predictors (Yamaguchi et al., 2019).
Spatial statistics: For positive-valued spatial prediction, asymmetric losses allow cost-aware tuning (e.g., via $\lambda$ in PDL) to optimize credible interval width and bias–variance balance in spatial interpolation problems (Pearse et al., 2024).

Empirical evidence consistently demonstrates that increasing the asymmetry ratio aligns the loss with the preferred error structure (e.g., favoring recall or class robustness) and enhances performance under practical dataset imperfections (Zhou et al., 2021, Wang et al., 23 Jul 2025).

References:

(Pearse et al., 2024) "Optimal prediction of positive-valued spatial processes: asymmetric power-divergence loss"
(Ben-Baruch et al., 2020) "Asymmetric Loss For Multi-Label Classification"
(Yamaguchi et al., 2019) "Minimizing the expected value of the asymmetric loss and an inequality of the variance of the loss"
(Park et al., 2023) "Robust Asymmetric Loss for Multi-Label Long-Tailed Learning"
(Hashemi et al., 2018) "Asymmetric Loss Functions and Deep Densely Connected Networks for Highly Imbalanced Medical Image Segmentation..."
(Wang et al., 23 Jul 2025) "Joint Asymmetric Loss for Learning with Noisy Labels"
(Zhou et al., 2021) "Asymmetric Loss Functions for Learning with Noisy Labels"