HALD: Hard Label for Local Semantic Drift

Updated 24 December 2025

The paper introduces HALD, a hybrid training paradigm that interleaves hard-label calibration with soft-label phases to alleviate local semantic drift in dataset distillation.
HALD employs a three-stage schedule—pretraining, hard-label calibration using CutMix and label smoothing, and refinement—to reduce misalignment and systematic errors.
Experimental evaluations on ImageNet-1K and Tiny-ImageNet show significant accuracy gains and improved gradient alignment under tight storage constraints.

Hard Label for Alleviating Local Semantic Drift (HALD) is a training paradigm designed to mitigate local semantic drift arising from the use of finite soft-label supervision in dataset distillation and knowledge transfer. HALD directly addresses the challenge that, under limited soft-label coverage, crops or augmentations of a synthetic or real image can yield teacher-generated soft labels whose semantics substantially deviate from the intended class, resulting in a misalignment between local content and global semantic annotation. HALD leverages hard labels as intermittent corrective signals to provide a content-agnostic class anchor, thereby restoring alignment, reducing systematic errors, and overcoming the bias and variance limitations inherent to soft-label–only protocols (Cui et al., 17 Dec 2025).

1. Local-View Semantic Drift: Origin and Theoretical Analysis

Local-View Semantic Drift (LVSD) arises when the soft-label distribution provided by a teacher model is inconsistent across locally-sampled views (e.g., augmentations or crops) of the same underlying image. Formally, for a distilled image $\tilde{x}$ and cropping transformation $\mathcal{T}(\tilde{x})$ , the teacher generates a $C$ -way soft label $\tilde{p}(x_{\text{crop}}) \in \Delta^{C}$ for each crop $x_{\text{crop}} \sim \mathcal{T}(\tilde{x})$ . The mean $\bar{p} = \mathbb{E}[\tilde{p}(x_{\text{crop}})]$ and covariance $\Sigma = \text{Cov}[\tilde{p}(x_{\text{crop}})]$ characterize the supervision signal. LVSD is present whenever $\Sigma \neq 0$ .

Key results demonstrate that, for a finite set of $s$ soft-label crops, the empirical soft label average $\hat{p}_s$ concentrates to $\bar{p}$ with variance decreasing only as $O(1/s)$ :

$\mathbb{E}\|\hat{p}_s - \bar{p}\|^2 = \text{Tr}(\Sigma)/s.$

Additionally, the bias in empirical loss and generalization penalty due to limited $s$ is lower-bounded: $\mathbb{E}|\mathcal{L}_s(\theta; \tilde{x}) - \mathcal{L}_\text{ideal}(\theta; \tilde{x})| \geq C_0 \cdot \sigma/\sqrt{s}\quad;\quad \mathbb{E}[\mathcal{L}_\text{ideal}(\hat{\theta}_s) - \mathcal{L}_\text{ideal}(\hat{\theta}_*)] \geq (1/2s) \cdot \text{Tr}(H_*^{-1}\Sigma_*) - O(s^{-3/2}),$ where $C_0$ is a constant, $\sigma$ is the per-crop loss variance, and $H_*$ and $\Sigma_*$ refer to the Hessian and soft-label covariance at the minimizer. Thus, any practical soft-label distillation protocol with finite crop coverage incurs a systematic and slow-decaying error component (Cui et al., 17 Dec 2025).

2. HALD: Hybrid Loss Formulation and Rationale

HALD proposes an explicit three-stage hybrid training schedule, alternating between soft-label and hard-label supervision to counteract LVSD. Instead of relying solely on soft-label cross-entropy loss, HALD interleaves a "calibration" phase using hard labels. Let $q_{\theta}(c|x)$ denote the student's predicted class probability and $\mathcal{L}(\cdot,\cdot)$ the cross-entropy.

HALD stages:

Stage A (soft pretraining): Minimize the cross-entropy between soft labels and predictions on a fixed pool $\Omega_{\text{soft}}$ of $s$ soft-label crops.
Stage B (hard-label calibration): Introduce hard-label supervision via CutMix-augmented samples, with label smoothing applied. The calibration loss $L_\text{hard}$ is optimized over randomly mixed pairs $(x, x')$ .
Stage C (soft refinement): Resume soft-label distillation as in Stage A.

In formal terms, a mixed-objective version would use $L_\text{mix}(\theta) = \beta \cdot L_\text{soft}(\theta) + (1-\beta) \cdot L_\text{hard}(\theta)$ for mixing coefficient $\beta$ , but HALD empirically employs a discrete-phase approach for improved alignment (Cui et al., 17 Dec 2025).

Theoretical justification is provided via gradient similarity analyses. The cosine similarity between gradients of $L_\text{soft}$ and $L_\text{hard}$ increases during training, indicating that hard labels act as a high-variance, content-agnostic anchor, aligning the student’s trajectory with the intended global class semantics.

3. Algorithmic Schedule and Implementation

HALD's protocol defines precise sequencing of training phases based on pre-computed soft-label resources and training budget:

Precompute soft label pool $\Omega_{\text{soft}}$ by extracting multiple crops per image and storing teacher logits.
Set the total number of soft-label epochs $n_\text{soft}$ (estimated by running Soft-Only training to validation plateau).
Divide $n_\text{soft}$ between Stage A (pretraining) and Stage C (refinement); allocate the remainder $n_\text{hard}$ to Stage B (calibration).
In Stage B, employ CutMix ( $\lambda\sim$ Beta), label smoothing $\alpha\approx 0.8$ , and random crop pairing for hard label batches.

Empirically, the optimal schedule sets Stage A and C durations to $n_\text{soft}/2$ each, with hard-label calibration completing the epoch budget. Label-smoothing in the calibration phase further stabilizes training (Cui et al., 17 Dec 2025).

4. Empirical Performance and Storage Efficiency

HALD has been validated on dataset distillation (Tiny-ImageNet, ImageNet-1K) and real data subsets, under aggressive storage constraints:

On ImageNet-1K, with a 285 MB soft-label storage budget, HALD achieves 42.7% top-1 accuracy, surpassing LPLD by 9.0% absolute.
With only 95 MB storage, HALD retains 36.9% accuracy versus LPLD's 14.3% under identical conditions.
Cross-architecture experiments (ResNet-18/50, MobileNetV2, ShuffleNetV2, DenseNet121, EfficientNet-B0, VGG-11/16, ViT-Tiny) show universal gains from HALD (1–8% absolute improvements with IPC=10, SLC=100).
On real ImageNet-1K subsets, HALD consistently improves student accuracy for varied subset sizes and storage allocations.

Ablation studies confirm the centrality of the hard-label phase: exclusion leads to severe drops in accuracy, greater misalignment of train/test loss landscapes, and substantially higher semantic drift as measured by crop-to-crop JS-divergence and cosine similarity. Joint soft–hard mixing or naïve phase orders (e.g., hard→soft→hard) underperform relative to the HALD protocol (Cui et al., 17 Dec 2025).

Storage Budget (MB)	HALD Top-1 Acc. (%)	LPLD Top-1 Acc. (%)	Absolute Gain
285	42.7	33.7	+9.0
95	36.9	14.3	+22.6

5. Mechanistic Insights and Theoretical Analysis

HALD is justified both empirically and theoretically as a means to correct $O(1/s)$ soft-label bias and mitigate distribution misalignment:

Semantic drift quantification: LVSD is dominant in ≳97% of images, with significantly higher variance in "strong" (visually ambiguous) crops versus "weak" ones.
Train/Test loss alignment: HALD closes the gap between minima found on finite soft-label training and the generalization optimum, restoring congruence in the loss landscape.
Gradient similarity: The average cosine between soft- and hard-label batch gradients grows to >0.9 during Stage A, supporting the theoretical bound for effective sample size enhancement.
Semantic calibration: Stage B sharply reduces crop-to-crop divergence (JS drops from 0.18→0.04), increases crop prediction cosine similarity (0.74→0.96), and betters alignment with the infinite-coverage soft-label oracle.

Theoretically, under bounded gradient spread and loss smoothness, intermittent hard-label steps provide a low-variance class anchor, effectively increasing the usable soft-label sample size and reducing the generalization error lower bound (Cui et al., 17 Dec 2025).

6. Practical Considerations and Protocol Design

For maximal HALD effectiveness:

Precompute soft-label pools with sufficient crop diversity (SLI ≥ 1, typically 1–2 crops per image for Tiny-ImageNet, 5–10 for ImageNet-1K).
Estimate soft-label phase length via baseline soft-only convergence analysis (often ≈200 epochs).
Configure phase durations with $T_A = T_C = n_\text{soft}/2$ , use remaining epochs for calibration.
Employ CutMix and label smoothing ( $\alpha=0.8$ recommended) during calibration.
Resume soft-label refinement to recover fine-grained inter-class structure post-calibration.
HALD is backbone-agnostic; compatible with CNNs and ViT variants.

Under severe storage constraints (SLC ≤ 150), HALD substantially outperforms any soft-only method, with gains of 8–20% absolute accuracy across storage budgets (Cui et al., 17 Dec 2025).

7. Comparative Evaluation and Broader Significance

HALD re-establishes the utility of hard labels within the modern soft-label–dominant distillation landscape. The content-agnostic anchor provided by hard-label calibration curbs LVSD-induced systematic errors not captured by standard soft-label averaging or joint objectives. Baseline methods based on joint loss mixing or alternative phase scheduling manifest inferior alignment and generalization performance.

The central insight is that hard labels, when judiciously incorporated, serve not as a blunt tool but as a corrective signal that harmonizes soft-label–driven fine-tuning with robust class-level grounding. This perspective calls for a renewed consideration of hybrid loss designs, particularly in the context of data- and compute-efficient knowledge transfer at scale (Cui et al., 17 Dec 2025).

Markdown Report Issue Upgrade to Chat

References (1)

Hard Labels In! Rethinking the Role of Hard Labels in Mitigating Local Semantic Drift (2025)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Hard Label for Alleviating Local Semantic Drift (HALD).