Soft-Label Training in Supervised Learning

Updated 25 November 2025

Soft-label training is a supervised learning approach that assigns full probability distributions as labels to capture inherent annotation uncertainty.
It enhances model calibration by aligning predictions with human-derived uncertainty, reducing overconfidence in ambiguous cases.
Implementing soft-label training involves replacing one-hot vectors with empirical distributions in the loss function, leading to measurable improvements in KL divergence and entropy correlation.

Soft-label training is a supervised learning paradigm in which target labels for each instance are specified not as single "hard" class assignments (one-hot vectors), but as discrete probability distributions over all class labels that reflect uncertainty, ambiguity, or annotator disagreement. In contrast to traditional hard-label (one-hot) supervision, soft-label training aims to align model predictions with distributions that more faithfully represent epistemic uncertainty in the labeling process. This approach is motivated by the observation that collapsing multiple human annotations to a single label discards meaningful information about label ambiguity and diversity of human judgment, often resulting in models that display unwarranted overconfidence on inherently ambiguous or difficult cases (Singh et al., 18 Nov 2025).

1. Definition and Mathematical Framework

In standard multiclass classification, each training example $(x_i, y_i)$ consists of features $x_i$ and a hard label $y_i \in \{1, \ldots, K\}$ , with supervision typically implemented as a one-hot vector in the cross-entropy loss. In the soft-label framework, the label is replaced by a distribution $p_i = [p_{i1}, ..., p_{iK}]$ such that $p_{ik} \geq 0$ and $\sum_k p_{ik}=1$ , representing, for example, the empirical frequency of votes among human annotators.

The soft-label cross-entropy loss for a prediction $q = [q_1,\ldots,q_K]$ and a soft target $p = [p_1,\ldots,p_K]$ is: $L_{CE}(p, q) = -\sum_{k=1}^K p_k \log q_k$

This subsumes the hard-label case as a special instance where $p$ is one-hot. Alignment between model and human uncertainty can be quantitatively evaluated via Kullback-Leibler divergence: $\mathrm{KL}(p \parallel q) = \sum_k p_k \log\frac{p_k}{q_k}$ and predictive entropy: $H(q) = -\sum_{k=1}^K q_k \log q_k$ (Singh et al., 18 Nov 2025).

2. Epistemic Uncertainty and Theoretical Justification

Hard-label training incentivizes models to make maximally confident predictions even on data points where annotators disagree—a regime that is epistemically inconsistent with ambiguous ground-truth annotation. Empirical annotation distributions encode epistemic (as opposed to merely aleatoric or label noise) uncertainty, and treating these distributions as targets in training enables models to express this uncertainty in their output probabilities (Singh et al., 18 Nov 2025).

Mathematically, since minimizing cross-entropy with soft-labels is equivalent to minimizing $\mathrm{KL}(p \parallel q)$ up to an additive constant (the entropy $H(p)$ ), the optimization objective directly encourages $q \approx p$ , thus enabling models to track both central tendency and uncertainty of annotation distributions. This is in contrast to hard-label training, which collapses the support of $p$ to a single class and incentivizes maximal certainty, regardless of underlying ambiguity.

3. Empirical Benefits

Quantitative studies demonstrate that soft-label training regimes yield several advantages over standard hard-label training:

Uncertainty Calibration: Soft-label training achieves lower KL divergence to annotation distributions (32% reduction on average), and much higher correlation between model and annotation entropy (61% improvement), demonstrating superior alignment with human-level uncertainty. For example, on CIFAR-10H-Hard, the model KL drops from 0.596 (hard) to 0.406 (soft), and entropy correlation improves from 0.353 to 0.493 (Singh et al., 18 Nov 2025).
Accuracy: Soft-label models match, and in some NLP tasks exceed, hard-label accuracy. On ChaosNLI, soft-label training even significantly improves accuracy (55.3% vs 51.75%, $p<0.001$ ), while on vision tasks accuracy is maintained within statistical uncertainty.
Overfitting Resistance: Models trained with soft labels demonstrate more stable learning dynamics, continuing to improve on validation loss over more epochs, whereas hard-label models tend to overfit earlier.
Robustness Across Annotation Sparsity: Gains are observed even with sparse annotation per example (as few as 6 annotations), without need for label smoothing or regularization beyond using the raw empirical vote distribution (Singh et al., 18 Nov 2025).
Calibrated Confidence Estimates: Predictions are better calibrated, which is critical for applications where reliable uncertainty estimates are required in downstream decision-making.

4. Implementation Methodology

Soft-label training can be implemented with minimal architectural modifications:

Input Targets: Each sample is provided with its full empirical annotation distribution $p$ in place of a hard label.
Loss Function: The model outputs softmax probabilities $q$ and is trained with cross-entropy loss $L_{CE}(p, q)$ .
Architectures: The method applies to both vision (e.g., DINOv2 Small + MLP) and NLP (e.g., OpenAI Text Embeddings + MLP) backbones (Singh et al., 18 Nov 2025).
Optimization: Standard optimizers (e.g., Adam) and hyperparameter grid search are used. Early stopping and identical random seeds are recommended to ensure comparability with hard-label baselines.
Sparse Annotations: Direct use of empirical frequencies is sufficient; no ad-hoc smoothing is required, and benefits are preserved even as annotation count per example decreases.

5. Practical and Epistemological Implications

By re-conceptualizing empirical annotation distributions as ground-truth, soft-label training establishes a normative shift for ambiguous data:

Preserving Information: Soft targets prevent the epistemic information loss that occurs when converting multi-annotator labels or ratings into single-class decisions.
"Knowing What It Doesn't Know": Models trained in this fashion produce higher-entropy outputs on ambiguous samples, aligning predictive uncertainty with human judgment and facilitating more calibrated downstream use.
Improved Calibration and Robustness: This approach reduces the likelihood of false certainty, addressing a key failure mode in high-stakes or inherently subjective tasks such as politeness, natural language inference, and fine-grained vision categories.
Limitations: The method's main limitations lie in increased annotation cost for multiple labels per instance and assumptions that human disagreement encodes legitimate ambiguity rather than systematic bias or labeling error. Performance in structured prediction and regression settings is yet to be exhaustively established (Singh et al., 18 Nov 2025).

6. Connections to Broader Labeling and Learning Paradigms

Soft-label training generalizes several prior approaches:

Label Smoothing: A special case where soft targets are convex combinations of one-hot and uniform distributions, not grounded in human annotator distributions.
Knowledge Distillation: Soft labels derived from teacher model posteriors, with loss incurred via KL divergence or cross-entropy, but generally not grounded in human uncertainty.
Multi-Annotator Learning: Directly uses empirical annotator distributions or confidence-weighted aggregates (e.g., Bayesian calibration based on annotator reliability and secondary choices (Wu et al., 2023)).
Epistemic vs Aleatoric Uncertainty: Soft-label targets derived from annotation distributions focus on epistemic sources, contrasting with synthetic label-smoothing that aims at regularization rather than explicit uncertainty tracking.

In summary, soft-label training, by treating annotator disagreement distributions as ground-truth, produces models that not only perform well in standard accuracy metrics but also mirror the diversity and uncertainty inherent in human perceptual and linguistic judgments, thus offering improved calibration, robustness, and epistemic validity in outputs (Singh et al., 18 Nov 2025).