Label Smoothing Regularization

Updated 9 March 2026

Label Smoothing Regularization is a technique that softens one-hot targets by blending ground-truth with a uniform or adaptive distribution, reducing overconfidence and improving model calibration.
Adaptive extensions, including negative, pairwise, and self-knowledge variants, tailor smoothing to control the trade-offs between accuracy, privacy, and optimization stability.
Empirical studies show that LSR enhances test accuracy and robustness while highlighting the need for careful tuning to balance improvements in generalization against privacy risks.

Label Smoothing Regularization (LSR) is a widely adopted regularization method that replaces hard one-hot targets in classification tasks with softened, probabilistic labels—typically interpolating between the ground truth indicator and a uniform distribution. Originally introduced to mitigate overconfidence in neural network outputs, LSR has become standard practice in deep learning due to its beneficial effects on generalization, calibration, and robustness across vision, language, and multimodal tasks. The landscape of LSR research now spans classic uniform smoothing, adaptive and pairwise variants, theoretical analyses of the optimization dynamics, privacy implications in adversarial contexts, and connections to entropy regularization and knowledge distillation.

1. Canonical Formulation and Generalizations

Standard LSR, in a $K$ -class classification setting with one-hot label $y\in\{0,1\}^K$ ( $y_i=1$ for the correct class, $0$ otherwise), introduces a smoothing parameter $\alpha\in[0,1]$ . The smoothed target is defined as: $\tilde y_i = (1-\alpha)y_i + \frac{\alpha}{K}$ or equivalently, for all $i=1,\dots,K$ , $\tilde y = (1-\alpha)y + (\alpha/K)1$ . The cross-entropy loss minimization becomes: $L^{LS}(y, p) = -\sum_{i=1}^K \tilde y_i \log p_i$ This has the effect of removing incentives for the model to predict arbitrarily confident probabilities on the correct class and encourages a nonzero probability mass for alternatives.

Generalizations permit nonuniform priors for smoothing, as in knowledge-distillation scenarios where the uniform distribution is replaced with a "teacher" distribution, or instance-dependent smoothing where the mixing parameter $\alpha$ is adapted at the per-sample level or across training epochs. Importantly, recent work extends the range of $\alpha$ to negative values ("negative label smoothing"), yielding

$\tilde y = (1-\alpha)y + \frac{\alpha}{K} 1, \quad \alpha\in(-\infty,1]$

Negative values induce targets that overshoot one-hot assignments (true class $>1$ , others $<0$ ), with the effect of reinforcing model confidence on the correct class and penalizing incorrect classes more harshly (Struppek et al., 2023).

2. Mechanistic Insights: Regularization Effects, Theoretical Perspectives

Label smoothing regularizes neural networks by softening hard targets, which addresses several overfitting mechanisms:

Reduction of Overconfidence: By enforcing a lower bound on the entropy of the output distribution, models are prevented from saturating softmax probabilities near $1$ for the correct class. This mitigates poor generalization typical of overfit networks, improves out-of-distribution behavior, and aids calibration (Ding et al., 2019, Zhang et al., 2020).
Loss Surface Smoothing and Variance Reduction: LSR reduces the variance of stochastic gradients in SGD, accelerating convergence and improving optimization stability. Variance reduction is maximized when the auxiliary label (e.g., uniform) has lower gradient variance (Xu et al., 2020).
Bias-Variance Trade-off: The smoothing factor $\alpha$ presents a trade-off: small positive values improve generalization and calibration but excessive smoothing may over-bias solutions and degrade accuracy, especially as the effective target diverges from the underlying true label distribution (Xu et al., 2020, Yamasaki et al., 2023).
Impossibility of Output Sparsity: Uniform label smoothing (KL $(u\|p)$ ) strictly forbids zero probability assignments on any class assigned nonzero mass by the prior, hindering applications where exact output sparsity is required (e.g., grammatical constraints, certain structured prediction tasks) (Meister et al., 2020).

Advanced theoretical characterizations relate LSR to generalized entropy regularization and parametric families of divergence-based penalties (e.g., skew-Jensen divergences), with LS falling out as the special case of KL $(u\|p)$ . Performance is typically governed by model output entropy, and the relationship between entropy and downstream task metrics (such as BLEU for NMT) is nonlinear; moderate entropy maximizes generalization (Meister et al., 2020).

3. Adaptive, Pairwise, and Instance-Specific Extensions

Several methods aim to overcome the limitations of global uniform smoothing:

Adaptive Label Regularization (ALR): Learns an online, per-class residual distribution to encode common misclassifications, capturing "dark knowledge" without reliance on external teachers. ALR adapts the smoothing target during training, leading to better transfer of inter-class correlations and improved performance on vision and text classification benchmarks (Ding et al., 2019).
Online Label Smoothing (OLS): Constructs updated soft targets from correctly classified examples’ moving-average predictions per class, yielding better alignment with true class similarities and improving calibration, noise robustness, and adversarial resistance (Zhang et al., 2020).
Pairwise/Midpoint Label Smoothing (PLS): Operates on pairs or synthetic midpoints between examples, defining the soft label as a blend of two ground-truth labels and a learned smoothing distribution. PLS dynamically increases label uncertainty in difficult regions, producing conservative softmax outputs and often stronger performance than classic LS—particularly reducing overconfidence in ambiguous or atypical samples (Guo, 2021, Guo, 2020).
Structural Label Smoothing (SLS): Assigns cluster-dependent smoothing strengths based on estimated Bayes-error overlap, mitigating regularization bias in low-overlap (easy) regions and enhancing overall accuracy relative to uniform LS (Li et al., 2020).
Bi-level and Self-knowledge Adaptive Smoothing: LABO (bi-level optimization) learns optimal per-instance label smoothing distributions by balancing fit to the model output and a confidence-penalty term (KL to uniform), yielding deterministic per-sample smoothing and further connecting LS to self-distillation under appropriate parameter regimes (Lu et al., 2023). Adaptive self-knowledge smoothing ties the smoothing strength to normalized entropy of the output and replaces the uniform prior with a snapshot of the model’s own past prediction, achieving superior calibration and generalization (Lee et al., 2022).

4. Impact on Generalization, Calibration, and Privacy

Generalization & Calibration

LSR consistently boosts out-of-sample accuracy and reduces overfitting across a range of tasks and architectures. For instance, in face recognition benchmarks, positive $\alpha$ increases test accuracy (e.g., ResNet-152/FaceScrub: 94.9% $\to$ 97.4% for $\alpha=0.1$ ), and sharply improves calibration metrics (ECE). In NMT and source code summarization, BLEU gains of $+0.5$ –$1.8$ are typical for optimal smoothing settings (Haque et al., 2023).

Privacy Implications

Counterintuitively, standard positive label smoothing increases susceptibility to model inversion attacks (MIAs) by creating dense, tightly separated class clusters in feature space and stabilizing inversion optimization dynamics. Privacy leakage is thus amplified, as measured by increased MIA accuracy and decreased distances between synthesized and real data (Struppek et al., 2023). Negative label smoothing ( $\alpha<0$ ) in contrast suppresses class information leakage, reducing MIA attack success by large margins at modest accuracy cost; e.g., $\alpha=-0.05$ drops inversion success by $80$ percentage points on FaceScrub, outperforming state-of-the-art privacy defenses without architectural changes.

Interpretation: The "V"-shaped utility–privacy trade-off curve in (Struppek et al., 2023) succinctly summarizes the effect—small positive $\alpha$ maximizes accuracy but also privacy risk; small negative $\alpha$ optimally impedes MIAs, with intermediate values offering balanced trade-offs.

5. Failure Modes, Task-dependent Side Effects, and Mitigations

While LSR improves generalization and calibration, several adverse or subtle effects have been identified:

Selective Classification Degradation: LSR disrupts the rank-ordering of model uncertainty, particularly in selective classification where rejection decisions depend on the maximum softmax probability. LS suppresses the max logit more for correct predictions and less for errors, causing overconfident misclassified samples to be ranked as overly certain and undermining risk-coverage curves in critical applications. Post-hoc logit normalization (mean-centering or $p$ -norm) at test time can restore proper uncertainty ranking (Xia et al., 2024).
Representation Collapse and Error-amplification: LSR drives feature representations into excessively compact clusters and inadvertently exacerbates overconfidence on misclassified examples due to an error-amplification term in the logit-level loss decomposition. This is theoretically clarified in the MaxSup approach, which replaces ground-truth logit penalization with a penalty on the top-1 logit, yielding better intra-class diversity and mitigating excessive confidence on incorrect predictions (Zhou et al., 18 Feb 2025).
Inefficacy at High Label Noise: Positive label smoothing loses effectiveness and may exacerbate under-confidence in high-noise or highly entropic label regimes. Negative label smoothing (NLS), where $\alpha<0$ , theoretically and empirically outperforms LS as noise increases, producing sharper separation between correct and incorrect predictions and subsuming many robust loss techniques (Wei et al., 2021).
Estimation Consistency and Efficiency: In parametric models (e.g., logistic regression), LS modifies both the loss structure ("smoothed KL") and the implied estimator—introducing logit squeezing and potential inconsistency in probability estimation. Modified LS losses that retain the softmax link (e.g., MLSLR) recover consistency properties and offer better 0–1 classification risk and robustness under model misspecification at modest efficiency loss (Yamasaki et al., 2023).

6. Practical Implementation and Best Practices

Table: Effects of positive and negative smoothing on privacy and utility (FaceScrub, ResNet-152) (Struppek et al., 2023)

α	Test Acc	MIA Acc@1	δ_face	ξ_train
0.0	94.9%	94.3%	0.71	61.2%
+0.1	97.4%	95.2%	0.63	71.0%
–0.05	91.5%	14.3%	1.23	16.5%

Choice of α: For most applications, $\alpha\approx0.05$ –$0.2$ yields optimal test accuracy and calibration. For privacy-critical domains, small negative values (e.g., $\alpha\approx-0.05$ ) are recommended.
Adaptive extensions: When generalization and privacy must be balanced, sweeping $\alpha$ in $[–0.1,0.1]$ and selecting optimal utility–privacy trade-offs is effective (Struppek et al., 2023).
Task-specific adaptations: For selective classification, post-hoc logit normalization at inference mitigates LS’s adverse impact on uncertainty ranking (Xia et al., 2024).
Combining with other regularization: LSR complements, but does not replace, methods such as dropout, weight decay, and data augmentation (Haque et al., 2023).
Integration overhead: Most modern deep learning frameworks support LSR as an argument to standard loss functions; many adaptive schemes add only trivial computational or storage cost.

7. Notable Variants and Task-specific Innovations

Weakly Supervised Label Smoothing (WSLS): In learning-to-rank tasks, the uniform prior of LS is replaced by a distribution based on negative sampler scores (e.g., BM25), injecting graded domain-specific information into the smoothing distribution and improving ranking (Penha et al., 2020).
Semantic Label Smoothing (Seq2Seq): For sequence-to-sequence models, smoothing is extended from tokens to semantically relevant alternative sequences identified using embedding similarity and BLEU overlap, outperforming token-level approaches in MT benchmarks (Lukasik et al., 2020).
DRO-integrated LS: Recent advances embed LS within a distributionally robust optimization (DRO) framework, formulating shift-augmented regularization and producing explicit adversarial feature-level perturbations that generalize models in few-shot and domain-shifted settings (Wang et al., 2024).
Bi-level and Knowledge-Distillation Extensions (LABO, LsrKD): Bi-level approaches optimize smoothed targets for each instance under a fit-to-model plus KL-to-uniform penalty. LS is equated to the special case of knowledge distillation from a uniform or self-ensembled teacher, generalizing the regularization mechanism and improving both accuracy and calibration (Lu et al., 2023, Wang et al., 2020).

Label Smoothing Regularization thus presents a robust, extensible regularization paradigm with deep theoretical foundations and a broad applicability footprint. Precision in the selection of smoothing parameters and adaptation of the smoothing distribution—uniform, instance-specific, or adaptive—can yield strong trade-offs in accuracy, calibration, privacy, and robustness, provided the task-specific pitfalls (selective classification, noise regimes) are judiciously addressed. The field continues to evolve with task-adaptive strategies, privacy-aware tuning, and connections to broader entropy-based model regularization.