Early-Learning Regularization (ELR)

Updated 20 January 2026

Early-Learning Regularization is a technique that mitigates noisy label memorization by anchoring network outputs to early, reliable predictions.
It integrates a running average-based regularizer with standard cross-entropy loss, effectively suppressing overfitting to corrupted annotations.
Empirical results show ELR improves performance across diverse tasks, including large-scale image classification, remote sensing, and federated learning scenarios.

Early-Learning Regularization (ELR) is an algorithmic framework for mitigating the memorization of noisy labels in overparameterized neural networks. Unlike conventional robust-loss or sample-selection approaches, ELR directly exploits the characteristic two-phase training dynamics—early fitting to clean signals followed by memorization of noise. ELR incorporates a regularizer that anchors the network’s outputs to its own early predictions via running averages, thereby suppressing overfitting to corrupted annotations without requiring explicit sample or label correction. The methodology has been validated in large-scale image classification, multi-label remote sensing, and federated learning scenarios with a variety of noise models.

1. Theoretical Foundations of Early Learning and Memorization

In high-dimensional classification tasks, neural networks trained with standard cross-entropy loss on noisy datasets display two distinct phases: an "early-learning" phase in which the model predominantly learns cleanly labeled examples, and a subsequent "memorization" phase where it begins to fit noisy labels. This phenomenon is observed both empirically (Zhang et al., Arpit et al.) and is theoretically proven to hold not only for deep nonlinear models but also for linear softmax classifiers trained on mixtures of Gaussians with symmetric label noise (Liu et al., 2020). Formally, during early training epochs, the gradient aligns with the true separator (e.g., class means in linear mixtures), resulting in improved separability, even on wrongly labeled points. As training progresses, residuals for correctly labeled examples vanish, while noisy examples dominate the gradient, ultimately causing memorization.

Key equations illustrating this behavior include:

Cross-entropy loss:

$L_{CE}(\Theta) = -\frac{1}{n} \sum_{i=1}^n \sum_{c=1}^C y_i^{(c)} \log p_c(x^{[i]};\Theta)$

Early-learning is evidenced by gradients remaining aligned with clean signal up to a stopping time $T$ ; beyond $T$ , the model has capacity to fit random labels exactly.

2. ELR Algorithm and Objective Formulation

ELR augments standard supervised losses with a regularization term that penalizes agreement between the model’s current output and a historical, exponentially-smoothed average of its own earlier predictions. Specifically:

For each sample $i$ with features $x^{[i]}$ and (possibly noisy) one-hot label $y^{[i]}$ :

Model output: $p^{[i]} = \text{softmax}(f(x^{[i]};\Theta))$
Historical target: $t^{[i]} \leftarrow \beta t^{[i]} + (1-\beta)p^{[i]}$
Regularizer:

$R_{ELR}(\Theta;B) = \frac{1}{|B|} \sum_{i \in B} \log\left[1-\langle p^{[i]}, t^{[i]} \rangle \right]$

Full ELR objective (per minibatch $B$ ):

$L_{ELR}(\Theta;B) = L_{CE}(\Theta;B) + \lambda R_{ELR}(\Theta;B)$

Intuitively, when the model’s prediction $p^{[i]}$ begins to drift toward memorizing noise, the regularizer fires, suppressing updates that overly align with the temporally-averaged early outputs—anchoring training to the “trustworthy” early fits (Liu et al., 2020, Galatolo et al., 2021).

Pseudocode for one SGD iteration:

for minibatch B in data:
    for i in B:
        p[i] = softmax(f(x[i]; θ))
    loss_CE = -sum(log p[i][y[i]] for i in B)
    loss_reg = sum(log(1 - dot(p[i], t[i])) for i in B)
    L_ELR = loss_CE + (λ/|B|)*loss_reg
    θ = SGD_step(∇θ L_ELR)
    for i in B:
        t[i] = β*t[i] + (1-β)*p[i]

(Galatolo et al., 2021)

3. Extensions and Integration with Training Protocols

ELR can be enhanced through semi-supervised ensembling and advanced data augmentation:

Temporal ensembling: For each sample, $t^{[i]}$ is a moving average of past predictions; typically, $\beta \in [0.7, 0.99]$ .
ELR+: Two networks are trained in parallel, with cross-ensembling of targets and weight averaging (“Mean Teacher”); mixup augmentation is applied via $\tilde x = \alpha x^{[i]} + (1-\alpha)x^{[j]}$ .
Hyperparameters: $\lambda \in [1, 10]$ ; stability for $\beta \in [0.7, 0.99]$ , mixup $\alpha \in [0.2, 1]$ (Liu et al., 2020).

In multi-label settings, ELR is adapted with entry-wise penalties:

$R_{ELR}(\theta) = \frac{1}{N} \sum_{i=1}^N \sum_{c=1}^C \log\left( 1 - p_{i,c} y_{i,c} \right)$

and the total loss incorporates a confidence-weighted BCE (for selective label handling in noisy multi-label regimes) (Burgert et al., 13 Jan 2026).

4. Empirical Results and Performance Analysis

ELR has demonstrated state-of-the-art robustness across a range of synthetic and real-world noisy datasets:

CIFAR-10 (40% symmetric noise): CE=81.9%, GCE/SymCE=87.1%, ELR=89.2%, ELR+=91.5%
CIFAR-100 (40% symmetric noise): CE=48.2%, SymCE=62.3%, ELR=68.3%, ELR+=74.6%
Clothing1M (real web images, 38.5% noise): JointOptim=71.0%, DivideMix=72.2%, ELR=72.9%, ELR+=74.8%
WebVision (mini): ELR+ achieves top-1=77.8%, top-5=91.7% (Liu et al., 2020)

Replication on CIFAR-10/100 confirms ELR outperforms plain cross-entropy by 6–10 points under noisy settings, and does not degrade performance with clean labels (Galatolo et al., 2021). On the CDON commercial dataset (10–20% real-world label noise), ELR yields 78.8% agreement (top-1) with catalog labels; when combined with Sharpness-Aware Minimization, top-1 rises to 93.4% (+14.6 points) (Galatolo et al., 2021).

In multi-label remote sensing, integrating ELR boosts mean average precision under mixed and subtractive noise by up to 4.8 points at 40% corruption rates; additive noise is less affected (Burgert et al., 13 Jan 2026).

In federated learning, Federated Label-mixture Regularization (FLR) extends ELR via blending global and local running averages to defend against both client-side and aggregated memorization of noisy labels. FLR achieves superior test accuracy relative to baselines (FedAvg, DivideMix, ELR) across CIFAR-100 and Clothing1M under synthetic and real noise conditions, with largest gains when noise is severe or class distributions are heterogeneous (Kim et al., 2024).

5. Algorithmic Implementation and Practical Considerations

ELR is computationally lightweight, requiring only storage of one running-average vector per sample over the course of training. The memory update is efficient, and per-minibatch computation of the regularization term involves only a dot product and log per sample (Liu et al., 2020).

Hyperparameters are tuned via validation performance; ELR’s efficacy is stable across a broad range ( $\lambda$ , $\beta$ ).
In federated settings, blending parameters for FLR ( $\alpha$ , $\beta$ , $\gamma$ ) are tuned to optimize suppression of local versus global memorization.
ELR can be combined with other sample selection, data augmentation (mixup), and ensemble strategies without incompatibility.
For multi-label tasks, regularization is applied per output entry, with grid search over regularizer strength to avoid under- or overfitting (Burgert et al., 13 Jan 2026).

Overhead can be minimized by batch-level ensembling if per-sample memory storage is a constraint (Kim et al., 2024).

6. Limitations, Open Challenges, and Future Directions

While ELR achieves robust performance against noisy labels, several limitations and research opportunities remain:

Theoretical guarantees of the early-learning/memorization dichotomy are restricted to linear models and Gaussian mixtures; rigorous analysis for nonlinear deep architectures is still open (Liu et al., 2020).
ELR does not explicitly identify which labels are corrupted; integration with label-correction or advanced sample selection may yield further improvements.
The choice of similarity metric in the regularizer (log inner-product vs. KL/MSE) impacts effectiveness; the log inner-product is empirically optimal against confirmation bias, but alternatives may be studied (Liu et al., 2020).
In federated contexts, computational and storage overhead of running averages, especially with large-scale decentralized data, prompts investigation of mini-batch memory protocols (Kim et al., 2024).
Robustness in extreme class imbalance, adversarial noise regimes, or feature-label corruption scenarios has not been fully characterized.
Extending ELR/FLR to text or other non-image modalities, and integrating into production systems as first-line error detectors, are areas indicated for exploration (Galatolo et al., 2021, Kim et al., 2024).

A plausible implication is that early-learning-based regularization, by anchoring model updates to its own trustworthy predictions, can be widely adapted as a principled regularizer for large-scale, noisy and heterogeneous data environments in both centralized and decentralized learning paradigms.

ELR represents an advancement over prior noise-robust approaches such as sample-selection (Co-Teaching, MentorNet), loss-correction (Forward, PENCIL), and robust losses (MAE, Generalized CE, Symmetric CE), by not requiring external meta-labels or explicit detection of noise (Liu et al., 2020). Early-stopping as regularization, analyzed in incremental iterative regularization studies (Rosasco et al., 2014), is a related concept: the number of training epochs itself controls bias–variance trade-off, balancing fit quality and avoidance of noise overfitting. ELR operationalizes this regularization dynamically, anchoring updates via memory of early fitting.

ELR integrates with modern data augmentation frameworks, federated learning architectures (FedAvg, FedCorr, FedProx), and advanced optimization techniques (Sharpness-Aware Minimization), demonstrating compatibility and cumulative improvements in robustness to various forms of annotation noise (Galatolo et al., 2021, Kim et al., 2024).

Ongoing research continues to explore optimal strategies for combining ELR with confidence-based label handling, running average mechanisms, and cross-network ensembling to achieve enhanced generalization in diverse, large-scale, and noisy annotation environments.