Dynamic Labeling & Label Smoothing

Updated 7 April 2026

Dynamic labeling and label smoothing are regularization methods that transform fixed one-hot labels into adaptive, probabilistic targets to reduce neural network overconfidence.
These techniques dynamically adjust smoothing intensity using model confidence, confusion matrices, or Bayesian estimates to capture inter-class similarities effectively.
Empirical results show that dynamic approaches improve convergence, calibration, and noise robustness across domains such as vision, language, and graph learning.

Dynamic labeling and label smoothing encompass a spectrum of regularization strategies that replace static, one-hot targets with data-driven, probabilistic supervision signals. These methods, motivated by the persistent overconfidence of neural networks trained on hard labels, now include a growing suite of adaptive, instance-specific, and structurally informed mechanisms implemented across domains such as vision, language, and graph learning. This article systematically reviews foundational principles, adaptive variants, algorithmic designs, theoretical guarantees, and empirical outcomes, drawing on the latest results in the research literature.

1. Foundational Principles: Overconfidence and Uniform Label Smoothing

Standard label smoothing regularization (LSR) interpolates hard labels with a fixed prior, typically the uniform distribution, to penalize low-entropy model predictions. For a $K$ -class classification problem, given a one-hot target $y$ and uniform prior $u$ , the smoothed label for parameter $\alpha \in [0, 1]$ is

$\tilde{y} = (1-\alpha)y + \alpha u$

This modifies the cross-entropy loss to

$\ell_{\text{smooth}}(\theta; x, y) = (1-\alpha) \ell(\theta; x, y) + \alpha \ell(\theta; x, u)$

where $\ell(\cdot)$ is typically standard log loss.

Uniform LSR prevents overconfident predictions, reduces the variance of stochastic gradients, and can accelerate convergence in SGD, particularly in the early phase of training. However, it applies the same smoothing magnitude and distribution to all examples, ignoring inter-class structure, semantic similarity, model confidence, and the evolution of learning dynamics (Xu et al., 2020).

2. Adaptive and Dynamic Label Smoothing: Mechanisms and Algorithms

Dynamic labeling refers to schemes in which the smoothing pattern or magnitude is adapted during training, per instance or per region in data space. Key instantiations include:

Online Label Smoothing (OLS): The per-class smoothing prior is estimated dynamically from the network’s own evolving prediction statistics on correct samples. The core quantity is a $K \times K$ matrix $S^{(t)}$ (one column per true class) updated epoch-wise by aggregating predicted probabilities on correctly classified samples. At each epoch, the loss is a convex combination of the hard-label and soft-label cross-entropy using $S^{(t-1)}$ (Zhang et al., 2020, Choudhury et al., 22 Oct 2025).
Data-Driven Adaptive Smoothing (ALSSK, ALR): Both “Adaptive Label Smoothing with Self-Knowledge” (ALSSK) and “Adaptive Regularization of Labels” (ALR) modulate smoothing strength per-instance, based on model confidence. ALSSK computes entropy for each sample's predictive distribution and interpolates between the one-hot target and a “self-teacher” prior (selected from past model checkpoints based on generalization metric), with an adaptive weight

$y$ 0

ALR instead maintains a residual label distribution S, learned online, that captures empirical error patterns and confusions, and updates model targets accordingly (Lee et al., 2022, Ding et al., 2019).

Graph-Specific Adaptive Smoothing: In large-scale GNN training, “Adaptive Label Smoothing” (ALS) addresses subgraph-induced label bias. Label priors are propagated via neighborhood averaging; a small $y$ 1 smoothing matrix $y$ 2 then learns global inter-class relevance, producing smoothed labels mixed with the original one-hots via a (typically scheduled) $y$ 3 (Zhou et al., 2021).
Pairwise and Structural Smoothing: “Pairwise Label Smoothing” (PLS) forms pairwise averages of both input and labels, further mixing with an input-dependent prior learned via an auxiliary layer. The smoothing coefficient can itself be predicted per-pair (Guo, 2020). “Structural Label Smoothing” (SLS) (Li et al., 2020) analytically balances smoothing strength per data region to minimize induced Bayes error bias, subject to maintaining a global average.

Adaptive Smoothing Variant	Adaptation Performed	Prior/Mechanism
OLS	Class/epoch	Empirical confusion matrix S
ALSSK	Instance/iteration	Past model checkpoint or self-knowledge
ALR	Class/iteration	Online-learned residual correlation S
ALS (GNN)	Node/iteration	Neighborhood-propagation + global mixing
PLS	Input pair	Learned input-dependent prior
SLS	Data region (cluster)	Bias-aware closed-form per-cluster

3. Bayesian and Information-Theoretic Perspectives

Recent work demonstrates that variational Bayesian learning inherently induces adaptive, per-example label smoothing, without handcrafted adaptation rules. In the IVON algorithm (Yang et al., 11 Feb 2025), the variational posterior covariance translates to an example-specific smoothing magnitude: $y$ 4 where $y$ 5 is computed in closed form from the model predictive variance. Ambiguous or mis-labeled inputs receive larger smoothing, while confident ones remain minimally regularized.

Adaptive smoothing can also be motivated information-theoretically: entropy-based schedules automatically shrink gradients on overconfident predictions, regularizing heavily only where evidence is weak or decision boundaries are uncertain. Theoretical results show such dynamic regularization can reduce both optimization variance and generalization error, and closely relates to regularization in knowledge distillation (gradient reweighting) (Lee et al., 2022, Xu et al., 2020).

4. Structural and Region-Adaptive Smoothing

Uniform smoothing distorts the Bayes error rate (BER) nonuniformly across the feature space, with higher distortion in low-overlap, high-density regions. SLS explicitly formulates a smoothing assignment

$y$ 6

where $y$ 7 indexes data clusters, $y$ 8 are their weights, and $y$ 9 are their BER estimates. This scheme guarantees the average smoothing matches $u$ 0 while locally mitigating BER bias (Li et al., 2020).

Region-adaptive approaches, including SLS and ALASCA (which connects LS to implicit Lipschitz regularization), aim to impose stronger regularization near high-overlap/boundary regions and weaker regularization inside class-consistent islands, thereby improving calibration and sample efficiency, especially under heteroscedastic or noisy labels (Ko et al., 2022).

5. Empirical Performance and GNN/Domain-Specific Applications

Extensive evaluations demonstrate the superiority of dynamic and adaptive smoothing, summarized below:

General classification: Online and adaptive smoothing variants yield consistent improvements of 0.5–3.5% in top-1 accuracy over static LS, as shown on CIFAR-100, ImageNet, INAT21, and fine-grained datasets (Zhang et al., 2020, Liang et al., 2022).
Noisy/heterogeneous labels: ALASCA improves test accuracy by up to 10% under 20–80% symmetric or asymmetric label noise, outperforming both standard LS and heavier explicit Jacobian regularization (Ko et al., 2022).
Graph neural networks: ALS (GNN) improves test accuracy by 0.2–0.6% over vanilla or uniform smoothing across diverse GNN backbones and benchmarks, and controls overfitting from subgraph-induced label bias (Zhou et al., 2021).
Medical imaging: OLS reduces Expected Calibration Error (ECE) by up to 90% relative to static LS, improves top-1/top-5 accuracy (RadImageNet), and yields more compact and separated latent representations (Choudhury et al., 22 Oct 2025).
Pairwise and region-aware smoothing: PLS delivers 20–30% relative error reduction over baseline or ULS in vision benchmarks, though predicted confidences become highly conservative (requiring post-hoc calibration) (Guo, 2020). SLS consistently reduces BER and accelerates convergence, especially in heterogeneously clustered data (Li et al., 2020).
Variational methods: IVON (variational adaptive smoothing) surpasses static LS and state-of-the-art sharpness-aware minimization (SAM) in both synthetic (CIFAR-10/100, pairflip noise) and real (Clothing1M) benchmarks (Yang et al., 11 Feb 2025).

6. Implementation, Scheduling, and Practical Guidelines

Dynamic smoothing requires careful implementation:

Update schedule: For OLS and ALSSK, per-epoch updates to confusion priors or self-teacher checkpoints are empirically optimal; overly frequent updates introduce noise, infrequent updates lag adaptation.
Mixing coefficient selection: In ALSSK, adaptive $u$ 1 is set by normalized entropy; in OLS, fixed $u$ 2– $u$ 3 is effective, with early anchoring to hard labels recommended.
Auxiliary modules: PLS and ALASCA require specific architectural changes—auxiliary heads for dynamic priors or intermediate representation regularization.
Memory considerations: Storage and update of per-class or per-region confusion matrices (e.g., $u$ 4 for OLS) are practical up to moderate $u$ 5.
Combining with other regularizers/optimizers: Adaptive smoothing methods can be layered with knowledge distillation, CutOut, or SAM; ablation studies indicate additive benefits (Zhang et al., 2020, Ko et al., 2022, Yang et al., 11 Feb 2025).

Method	Typical Hyperparameters	Update Location/Interval	Typical Overhead
OLS	$u$ 6 (mix), epoch update	End of each epoch	Minimal
ALSSK	None (adaptive $u$ 7), teacher checkpoint freq.	Per forward, per epoch	Modest (KD-like)
PLS	None fixed; $u$ 8 learned/optional	Minibatch	Extra linear layer
SLS	$u$ 9, $\alpha \in [0, 1]$ 0, cluster number (#clusters)	One-time precompute	Clustering+MST
ALS (GNN)	$\alpha \in [0, 1]$ 1 (schedule), $\alpha \in [0, 1]$ 2 (label propagation)	Scheduled per epoch	Matrix-vector ops

7. Theoretical Guarantees and Open Challenges

Theoretical analysis confirms that dynamic or adaptive smoothing actively reduces gradient variance, speeds early-stage SGD convergence, and exerts effective regularization against overconfident fits and label noise. Under reasonable assumptions, approaches like TSLA achieve $\alpha \in [0, 1]$ 3 convergence to stationary points, outperforming vanilla SGD and static LS (Xu et al., 2020). Bayesian and structural formulations guarantee that adaptivity mitigates error rate distortion and aligns regularization strength with model uncertainty or data geometry (Yang et al., 11 Feb 2025, Li et al., 2020).

Nevertheless, open challenges remain regarding: (1) robustness of smoothing adaptation under rapidly drifting domains or highly imbalanced classes, (2) integration with non-discrete or structured labels, and (3) formalizing trade-offs between underconfidence (as in highly conservative PLS) and optimal calibration.

For recent developments, see (Zhou et al., 2021, Liang et al., 2022, Lee et al., 2022, Yang et al., 11 Feb 2025, Ko et al., 2022, Zhang et al., 2020, Choudhury et al., 22 Oct 2025, Guo, 2020, Li et al., 2020), and (Ding et al., 2019).