Noise-Tolerant Clipped MAE Loss

Updated 23 June 2026

The paper introduces noise-tolerant clipped MAE loss by bounding individual losses to reduce the impact of mislabeled or outlier instances.
It leverages symmetry and asymmetry conditions to theoretically guarantee robustness against both uniform and structured label noise.
Empirical results on datasets like MNIST and CIFAR-10 show enhanced classification performance under high noise conditions.

Noise-tolerant clipped mean absolute error (MAE) loss refers to a family of loss functions for multi-class learning that combine the statistical properties of MAE with explicit upper bounding ("clipping") or parametric smoothing, yielding both theoretical and empirical robustness to label noise. These losses are structurally designed to mitigate the influence of mislabeled or outlier instances in deep neural network optimization.

1. Mathematical Foundations and Symmetry

Central to the theoretical robustness of clipped MAE and its generalizations is the symmetry condition for loss functions. A loss $\ell(z, y)$ is called symmetric if, for all predictions $z \in \mathbb{R}^C$ ,

$\sum_{y=1}^{C} \ell(z, y) = \text{constant}$

This property ensures that, under symmetric label noise, minimization with respect to corrupted labels yields the same set of minimizers as the clean-label risk. For multi-class tasks, the standard MAE loss,

$\ell_{MAE}(z,y) = 1 - \softmax(z)_y,$

is symmetric by construction, and this also holds for any convex combination with other symmetric losses such as the multi-class unhinged loss $L_{\mathrm{unh}}(z,y) = -z_y + \frac{1}{C}\sum_{k=1}^C z_k$ (Paquin et al., 19 May 2026, Ghosh et al., 2017).

Noise-tolerant clipped MAE is further grounded in the theory of asymmetric loss functions (Zhou et al., 2021), which broadens analysis to noisy regimes beyond uniform flip models by exploiting the property that the loss penalizes deviations from the Bayes-optimal label more than other classes, formalized via the asymmetry ratio.

2. Definition and Formulation of Clipped MAE and Extensions

The prototypical clipped MAE (cMAE) loss is a direct modification of the standard MAE, upper-bounded by a threshold parameter $\tau \in [0,2]$ : $L_{\mathrm{cMAE}}(u, y) = \min\{\tau, 2 - 2u_y\}$ where $u = \mathrm{softmax}(z)$ and $u_y$ denotes the predicted probability for the true class. This truncation bounds the per-sample loss and thus the influence of samples with low predicted probability, effectively discarding extremely hard (potentially noisy) points.

An important generalization is the $\alpha$ -MAE loss, parameterizing a smooth trade-off between the linear unhinged loss and (rescaled) MAE: $z \in \mathbb{R}^C$ 0 with $z \in \mathbb{R}^C$ 1, and $z \in \mathbb{R}^C$ 2 the number of classes. As $z \in \mathbb{R}^C$ 3 increases, the loss becomes more saturating and less sensitive to outliers; for $z \in \mathbb{R}^C$ 4 it reduces to the bounded, symmetric MAE form (Paquin et al., 19 May 2026).

Another robust variant, IMAE, replaces the uniform implicit weighting of MAE's gradients with higher gradient variance to increase network capacity for clean examples while retaining noise immunity (Wang et al., 2019).

3. Theoretical Robustness to Label Noise

Clipped MAE inherits robustness to several structured label noise processes via symmetry and/or asymmetry arguments:

Symmetric (Uniform Flip) Noise: If noise rate $z \in \mathbb{R}^C$ 5, minimizers for clipped MAE coincide with those of clean risk; the empirical observations confirm minimal accuracy degradation up to high noise rates (Ghosh et al., 2017, Paquin et al., 19 May 2026).
Class-conditional and Non-uniform Noise: For label-flip matrices that are diagonally dominant (probability of retaining the true label exceeds the maximum flipping probability to any other class), clipped MAE remains robust as long as it remains strictly decreasing and bounded on each class probability coordinate (Zhou et al., 2021).
Asymmetric Loss Guarantee: Provided the asymmetry ratio $z \in \mathbb{R}^C$ 6 (with $z \in \mathbb{R}^C$ 7 quantifying domination by clean labels), clipped MAE with $z \in \mathbb{R}^C$ 8 satisfies calibration and noise-tolerance guarantees (Zhou et al., 2021).

4. Gradient Behavior and Optimization Properties

A central practical consideration is gradient behavior. For clipped MAE and $z \in \mathbb{R}^C$ 9-MAE:

The maximum gradient norm is explicitly bounded in $\sum_{y=1}^{C} \ell(z, y) = \text{constant}$ 0-MAE:

$\sum_{y=1}^{C} \ell(z, y) = \text{constant}$ 1

avoiding exploding gradients even for extreme input scores (Paquin et al., 19 May 2026).

For standard MAE, per-example gradient magnitude $\sum_{y=1}^{C} \ell(z, y) = \text{constant}$ 2 peaks at $\sum_{y=1}^{C} \ell(z, y) = \text{constant}$ 3 but has low variance, causing underfitting of the clean subset when noise is high (Wang et al., 2019).
IMAE exponents gradient weighting by $\sum_{y=1}^{C} \ell(z, y) = \text{constant}$ 4, increasing fitting ability without sacrificing MAE's emphasis on uncertain (potentially clean) points.

Gradient clipping is often unnecessary as clipping is structural, but can be applied for additional stability (Paquin et al., 19 May 2026).

5. Empirical Results and Benchmark Performance

Empirical studies on synthetic and real-world noisy-label benchmarks substantiate the effectiveness of noise-tolerant clipped MAE:

Dataset	Noise Rate η	CE	GCE	SCE	Clipped-MAE (τ=1.0)	α-MAE*	IMAE (T=8)
MNIST	0.8	22.7%	33.9%	48.8%	96.7%	—	—
CIFAR-10	0.8	19%	27%	—	~53%	62.1%	~82%
WebVision (ImageNet-1k)	—	67.0%	—	—	69.4%	—	—
CIFAR-10 (clean 40%)	0.4	63%	—	—	—	—	82%

*α-MAE achieves 62.1% on CIFAR-10 at 80% symmetric noise and 56.4% on CIFAR-100 at 60% noise; outperforms SCE, CE, and other robust baselines (Paquin et al., 19 May 2026, Zhou et al., 2021, Wang et al., 2019).

Clipped MAE and its variants maintain high test accuracy and clear separation of feature clusters even at extreme noise levels, while standard losses are highly degraded (Zhou et al., 2021, Wang et al., 2019). IMAE further improves clean-data fitting (from ~74% with MAE to ~93% with IMAE(8) on noisy CIFAR-10) (Wang et al., 2019).

6. Implementation Details and Hyperparameterization

Practical implementation is straightforward:

For clipped MAE: $\sum_{y=1}^{C} \ell(z, y) = \text{constant}$ 5 for $\sum_{y=1}^{C} \ell(z, y) = \text{constant}$ 6, with $\sum_{y=1}^{C} \ell(z, y) = \text{constant}$ 7 and $\sum_{y=1}^{C} \ell(z, y) = \text{constant}$ 8 effective in practice.
For $\sum_{y=1}^{C} \ell(z, y) = \text{constant}$ 9-MAE, convex interpolate $\ell_{MAE}(z,y) = 1 - \softmax(z)_y,$0 with $\ell_{MAE}(z,y) = 1 - \softmax(z)_y,$1 depending on observed under- or overfitting; grid search or validation split determines optimal $\ell_{MAE}(z,y) = 1 - \softmax(z)_y,$2 (Paquin et al., 19 May 2026).
For IMAE, choose $\ell_{MAE}(z,y) = 1 - \softmax(z)_y,$3 for noisy data and $\ell_{MAE}(z,y) = 1 - \softmax(z)_y,$4 for clean data; $\ell_{MAE}(z,y) = 1 - \softmax(z)_y,$5 multiplies the gradient via a detached scaling factor to control weighting variance (Wang et al., 2019).

Optimization employs standard SGD + momentum schedules; learning rate decay and regularization are as for standard cross-entropy training. Batch size is dataset-dependent (Zhou et al., 2021, Wang et al., 2019, Paquin et al., 19 May 2026).

7. Significance and Context within Robust Learning

Noise-tolerant clipped MAE and its parameterized extensions represent a consistent, theoretically-principled approach for robust risk minimization under label noise. The explicit bounding of the per-sample loss and careful control of gradient weighting variance avoid the pitfalls of underfitting inherent to vanilla MAE and the overfitting or instability of unbounded convex losses. These strategies have been validated across synthetic and real-world benchmarks, outperforming or matching other recent robust-loss alternatives (e.g., GCE, SCE, Focal Loss, NCE+RCE) (Zhou et al., 2021, Wang et al., 2019, Ghosh et al., 2017).

The adoption of symmetry and asymmetry theory for the analysis of noise-robustness provides a uniform explanation for the empirical effectiveness of these losses, as well as guarantees for classification calibration and excess risk bounds under broad noise models. The single-hyperparameter design (e.g., $\ell_{MAE}(z,y) = 1 - \softmax(z)_y,$6, $\ell_{MAE}(z,y) = 1 - \softmax(z)_y,$7, or $\ell_{MAE}(z,y) = 1 - \softmax(z)_y,$8), with robust validation heuristics, simplifies practical application and tuning.

Clipped MAE, $\ell_{MAE}(z,y) = 1 - \softmax(z)_y,$9-MAE, and IMAE are part of a new generation of principled noise-robust surrogates for cross-entropy in deep learning, with ongoing research exploring further parameterizations and adaptive mechanisms for noise-tolerant risk minimization (Paquin et al., 19 May 2026, Zhou et al., 2021, Wang et al., 2019, Ghosh et al., 2017).