Papers
Topics
Authors
Recent
Search
2000 character limit reached

Mean Absolute Error (MAE)

Updated 14 May 2026
  • Mean Absolute Error (MAE) is defined as the average absolute difference between predicted and true values, offering a linear penalty that is directly interpretable.
  • MAE is valued for its robustness to outliers and its applicability in regression, classification, and hardware verification, although its uniform weighting may lead to underfitting in clean data.
  • Enhancements like Improved MAE (IMAE) and the use of additional metrics (e.g., RMSD, error histograms) are recommended to address MAE's limitations with non-Gaussian error distributions.

Mean Absolute Error (MAE) is a fundamental metric in statistics and machine learning, widely employed for quantifying the average magnitude of prediction errors in regression and classification models. It is defined as the mean of the absolute differences between predicted and true values, with broad utility as both an objective function for model training and a benchmark for model evaluation. MAE's linear penalty on errors, invariance to error direction, and direct interpretability in the units of the predicted variable have made it a default choice in numerous research domains. However, its statistical properties, robustness to outliers, implications for optimization, and known shortcomings under non-Gaussian error distributions require rigorous assessment, especially for applications in deep learning and scientific modeling.

1. Mathematical Formulation and Properties

MAE is formally defined for a set of predictions {y^i}i=1N\{\hat{y}_i\}_{i=1}^N and ground-truth values {yi}i=1N\{y_i\}_{i=1}^N by

MAE=1Ni=1Nyiy^i\mathrm{MAE} = \frac{1}{N}\sum_{i=1}^{N} |y_i - \hat{y}_i|

For vector-valued prediction tasks, the definition generalizes to

MAE=1Ni=1Nyiy^i1\mathrm{MAE} = \frac{1}{N} \sum_{i=1}^{N} \|y_i - \hat{y}_i\|_1

where 1\|\cdot\|_1 denotes the L1L_1 norm (Qi et al., 2020, Qi et al., 2020, Pernot et al., 2020, Ramprasath et al., 2024).

The MAE is 1-Lipschitz in both arguments under the L1L_1 metric: for vectors x1,x2,xx_1,x_2,x,

x1x1x2x1x1x21\bigl|\|x_1-x\|_1 - \|x_2-x\|_1\bigr| \leq \|x_1-x_2\|_1

In regression, this guarantees that perturbations in predictions cannot produce disproportionate changes in the error metric. In classification with probabilistic outputs (post-softmax), the MAE of probability vectors versus one-hot targets simplifies to 2(1p(yixi))2(1-p(y_i|x_i)) per sample (Wang et al., 2019).

2. Statistical Interpretation and Robustness

MAE corresponds to the negative log-likelihood of a Laplace (double-exponential) error model: {yi}i=1N\{y_i\}_{i=1}^N0 Minimizing MAE thus assumes residuals with heavier tails than the Gaussian, as implied by mean squared error (MSE) minimization (Qi et al., 2020). As a result, MAE is less sensitive to large outliers compared to MSE. Under a zero-mean normal model, {yi}i=1N\{y_i\}_{i=1}^N1, but when error distributions are heavy-tailed or non-normal, this relationship and standard-probabilistic coverage do not hold. MAE does not provide explicit control over error dispersion or risk of large deviations under such conditions (Pernot et al., 2020).

3. MAE as a Loss Function in Deep Learning

MAE is employed directly as a training criterion in both regression and classification networks. In DNN-based regression (vector-to-vector), optimizing MAE offers bounded generalization error due to its Lipschitz property, admitting explicit bounds via Rademacher complexity arguments (Qi et al., 2020, Qi et al., 2020). MAE also supports robust generalization bounds under input noise, since

{yi}i=1N\{y_i\}_{i=1}^N2

where {yi}i=1N\{y_i\}_{i=1}^N3 depends on network gradient norms.

In deep classification, MAE is empirically noise-robust owing to its weighting of uncertainty. The per-example gradient magnitude for softmax outputs is proportional to {yi}i=1N\{y_i\}_{i=1}^N4, peaking at {yi}i=1N\{y_i\}_{i=1}^N5 and vanishing at {yi}i=1N\{y_i\}_{i=1}^N6. This means MAE downweights high-confidence samples (which are likely mislabeled under label noise) and focuses learning on medium-confidence cases (Wang et al., 2019). The symmetric structure of MAE, {yi}i=1N\{y_i\}_{i=1}^N7, formally underpins its robustness to uniform label noise.

However, MAE's weight-variance is low. The variance of {yi}i=1N\{y_i\}_{i=1}^N8 over {yi}i=1N\{y_i\}_{i=1}^N9 is small (approximately 0.09), so gradients across samples remain nearly uniform. In practice, this can cause severe underfitting of clean data—informative samples cannot prevail over noisy/uninformative counterparts. Improved MAE (IMAE) addresses this by exponentially rescaling the gradient magnitude, thus amplifying weighting variance and restoring fitting capacity without sacrificing noise robustness (Wang et al., 2019).

4. Performance Bounds and Error Decomposition

In DNN-based regression, the generalization error under MAE decomposes into three standard terms (Qi et al., 2020):

  • Approximation error: the minimum achievable by the hypothesis class
  • Estimation error: deviation between empirical and population risk, bounded in terms of Rademacher complexity and decaying as MAE=1Ni=1Nyiy^i\mathrm{MAE} = \frac{1}{N}\sum_{i=1}^{N} |y_i - \hat{y}_i|0 with dataset size MAE=1Ni=1Nyiy^i\mathrm{MAE} = \frac{1}{N}\sum_{i=1}^{N} |y_i - \hat{y}_i|1
  • Optimization error: excess risk from imperfect minimization, further bounded under smoothness and Polyak–Łojasiewicz conditions

Explicit upper bounds for MAE in vector-to-vector regression with DNNs are: MAE=1Ni=1Nyiy^i\mathrm{MAE} = \frac{1}{N}\sum_{i=1}^{N} |y_i - \hat{y}_i|2 where MAE=1Ni=1Nyiy^i\mathrm{MAE} = \frac{1}{N}\sum_{i=1}^{N} |y_i - \hat{y}_i|3 is output dimension, MAE=1Ni=1Nyiy^i\mathrm{MAE} = \frac{1}{N}\sum_{i=1}^{N} |y_i - \hat{y}_i|4 last-layer width, MAE=1Ni=1Nyiy^i\mathrm{MAE} = \frac{1}{N}\sum_{i=1}^{N} |y_i - \hat{y}_i|5 weight/input norms, and MAE=1Ni=1Nyiy^i\mathrm{MAE} = \frac{1}{N}\sum_{i=1}^{N} |y_i - \hat{y}_i|6 optimization constants.

Empirical studies confirm that these bounds are tight in practice, with model capacity predominantly affecting approximation error and sample size affecting estimation error. Over-parameterization helps decrease optimization error (Qi et al., 2020, Qi et al., 2020).

5. Limitations of MAE under Non-Normal Error Distributions

MAE presupposes a Laplacian or symmetric noise model for meaningful interpretation. In the context of quantum machine learning for atomistic simulations, benchmarking of models by MAE can be misleading when error distributions are heavy-tailed or skewed (Pernot et al., 2020). For example, the SLATM-L2 kernel ridge model obtained MAE similar to more robust methods on the QM7b dataset, but with much larger root mean square deviation (RMSD) and 95th percentile error (MAE=1Ni=1Nyiy^i\mathrm{MAE} = \frac{1}{N}\sum_{i=1}^{N} |y_i - \hat{y}_i|7) due to high kurtosis and non-normality.

Under non-Gaussian errors, the fraction of samples exceeding MAE,

MAE=1Ni=1Nyiy^i\mathrm{MAE} = \frac{1}{N}\sum_{i=1}^{N} |y_i - \hat{y}_i|8

varies significantly and cannot be inferred from MAE alone. Model ranking and risk assessment based purely on MAE may therefore misrepresent the tail risk and reliability.

Recommended practices include:

  • Reporting additional metrics (RMSD, MAE=1Ni=1Nyiy^i\mathrm{MAE} = \frac{1}{N}\sum_{i=1}^{N} |y_i - \hat{y}_i|9, empirical cumulative error distribution, skewness, kurtosis)
  • Conducting normality tests (e.g., Shapiro-Wilk statistic)
  • Addressing outliers through targeted training set augmentation This multi-metric, distribution-aware reporting produces more robust model comparisons and enhances interpretability in scenarios with non-normal error structure (Pernot et al., 2020).

6. Exact Computation in Hardware Verification

In circuit verification and approximate hardware design, MAE is defined over all MAE=1Ni=1Nyiy^i1\mathrm{MAE} = \frac{1}{N} \sum_{i=1}^{N} \|y_i - \hat{y}_i\|_10 input combinations as the average absolute output difference between exact and approximate circuits: MAE=1Ni=1Nyiy^i1\mathrm{MAE} = \frac{1}{N} \sum_{i=1}^{N} \|y_i - \hat{y}_i\|_11 where MAE=1Ni=1Nyiy^i1\mathrm{MAE} = \frac{1}{N} \sum_{i=1}^{N} \|y_i - \hat{y}_i\|_12 is the number of inputs yielding output error MAE=1Ni=1Nyiy^i1\mathrm{MAE} = \frac{1}{N} \sum_{i=1}^{N} \|y_i - \hat{y}_i\|_13 (Ramprasath et al., 2024).

SAT-based message-passing algorithms have been developed for exact computation of MAE. The pipeline involves:

  • CNF encoding of both exact and approximate circuits, together with subtractor logic for error extraction
  • Hypergraph partitioning to decompose the CNF formula into manageable subproblems
  • Construction of a clique-tree/junction-tree across CNF partitions
  • Sum-product message passing to obtain error histograms MAE=1Ni=1Nyiy^i1\mathrm{MAE} = \frac{1}{N} \sum_{i=1}^{N} \|y_i - \hat{y}_i\|_14 for each error MAE=1Ni=1Nyiy^i1\mathrm{MAE} = \frac{1}{N} \sum_{i=1}^{N} \|y_i - \hat{y}_i\|_15
  • Aggregation into probabilistic metrics (MAE, MSE, WCE) via direct summation over error magnitudes

This framework enables the tractable computation of the full error distribution for large circuits, and supports rapid queries for a variety of metrics, including MAE, with empirical runtimes suitable for nontrivial hardware validation tasks.

Domain MAE Definition Key Considerations
Scalar regression MAE=1Ni=1Nyiy^i1\mathrm{MAE} = \frac{1}{N} \sum_{i=1}^{N} \|y_i - \hat{y}_i\|_16 Interpreted as average error
Vector regression MAE=1Ni=1Nyiy^i1\mathrm{MAE} = \frac{1}{N} \sum_{i=1}^{N} \|y_i - \hat{y}_i\|_17 MAE=1Ni=1Nyiy^i1\mathrm{MAE} = \frac{1}{N} \sum_{i=1}^{N} \|y_i - \hat{y}_i\|_18-norm; globally 1-Lipschitz
Classification MAE=1Ni=1Nyiy^i1\mathrm{MAE} = \frac{1}{N} \sum_{i=1}^{N} \|y_i - \hat{y}_i\|_19 (for softmax + 1-hot) Emphasizes uncertain predictions
Circuit verification 1\|\cdot\|_10 All possible inputs; obtained via SAT+MP

7. Extensions and Best Practices

Variants of MAE have been developed to enhance its learning properties. In classification, Improved MAE (IMAE) modifies the gradient-magnitude weighting by applying an exponential transform, which increases variance and improves the ability to fit clean data while preserving noise robustness (Wang et al., 2019).

For robust benchmarking, researchers recommend never relying solely on MAE when error distributions are potentially non-Gaussian. Complementary reporting of RMSD, empirical percentiles, and error histograms, along with systematic bias correction and outlier analysis, is essential for meaningful performance evaluation (Pernot et al., 2020).

In summary, while MAE remains a cornerstone metric for measuring average error magnitude, it demands careful interpretation and, when necessary, augmentation with additional statistics to faithfully characterize model performance in complex, high-stakes, or non-standard regimes.

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Mean Absolute Error (MAE).