Dual-Granularity Sinkhorn Distillation (D-SINK)

Updated 11 October 2025

Dual-Granularity Sinkhorn Distillation (D-SINK) is a robust learning method that addresses class imbalance and label noise by combining insights from two specialized auxiliary models.
It leverages a dual-granularity framework where imbalance-robust and noise-robust teachers guide surrogate label allocation via entropy-regularized optimal transport using the Sinkhorn algorithm.
Empirical results on benchmark datasets show that D-SINK significantly improves performance on minority classes and maintains resilience against high rates of label noise.

Dual-granularity Sinkhorn Distillation (D-SINK) is a methodology for enhancing robustness and representational fidelity in learning problems characterized by heterogeneous data imperfections, notably the co-occurrence of class imbalance and label noise. The approach operates by distilling complementary information from specialized auxiliary models operating at distinct granularity levels—distributional (class-level) and sample-level—via an optimal transport framework leveraging the Sinkhorn algorithm. This article presents the theoretical foundations, optimization strategies, empirical results, and implications for robust learning as described in (Hong et al., 9 Oct 2025), along with context from related optimal transport and Sinkhorn theory.

1. Motivation and Problem Setting

The challenge addressed by D-SINK arises from the simultaneous presence of class imbalance (where some classes have far fewer samples, resulting in non-uniform class marginal distributions) and label noise (where individual instances have unreliable or corrupted labels). Existing algorithms targeting either issue (e.g., reweighting methods for imbalance, loss correction or sample-selection for noise) can inadvertently undermine performance when naively combined, as strategies to strengthen tail classes may magnify the impact of noisy labels and vice versa.

D-SINK exploits the observation that these two issues pertain to fundamentally different data granularities: class imbalance is a property of the overall label distribution, while label noise is an instance-level concern. The method proposes to combine the strengths of two weak auxiliary models—one optimized for class imbalance ("imbalance-robust teacher") and another for label noise ("noise-robust teacher")—by distilling their insights into the target model through a surrogate label allocation process optimized with Sinkhorn-based optimal transport.

2. Dual-Granularity Framework Architecture

D-SINK maintains three components:

Imbalance-robust auxiliary model $f_L$ : Trained with state-of-the-art methods for long-tailed learning, supplying reliable class marginal predictions.
Noise-robust auxiliary model $f_N$ : Trained with robust learning under noisy labels, providing trustworthy sample-level predictions.
Target model $f$ : Learns from dynamically constructed surrogate labels $\{q_i\}_{i=1}^N$ (probability vectors on the simplex) generated in each training batch.

The surrogate label allocation mechanism instantiates the "dual granularity": for each instance $i$ , $q_i$ is optimized to be close to the output of $f_N(x_i)$ (noise robustness), while the overall sum $\sum_{i=1}^N q_i$ is constrained to resemble the global class distribution predicted by $f_L$ (imbalance robustness). Thus, the alignment is enforced at both sample and class levels without requiring simultaneous optimization of both properties in a single teacher network.

3. Sinkhorn-Optimized Surrogate Label Assignment

The construction of the label matrix $Q = [q_1, \ldots, q_N]^T$ proceeds by solving a regularized optimal transport problem:

$\min_Q \langle Q, P \rangle + 2\sum_{i=1}^N q_i \cdot \log q_i$

subject to the constraints

$Q \cdot \mathbf{1}_N = \sum_{i=1}^N f_L(x_i), \quad Q^T \cdot \mathbf{1}_C = \mathbf{1}_N$

where $P$ is a cost matrix whose $i$ th row is $-\log f_N(x_i) - \log f(x_i)$ , and $C$ is the number of classes. The solution $Q$ is obtained via entropy-regularized OT and computed efficiently with the Sinkhorn-Knopp algorithm:

Set $M = \exp(-P/2)$
Iteratively update scaling vectors $u, v$ using $u \leftarrow (\sum_i f_L(x_i)) / (M v), \; v \leftarrow \mathbf{1}_N / (M^T u)$
Set $Q = N_B \cdot \operatorname{diag}(u) M \operatorname{diag}(v)$

This procedure simultaneously imposes sample-level proximity to $f_N(x_i)$ , and global distributional consistency with the marginal predicted by $f_L$ . The entropic regularization ensures uniqueness and numerical stability.

4. Optimization Objective and Implementation

The total training loss for D-SINK over a mini-batch is:

$\mathcal{L}_{\text{Overall}} = \mathcal{L}_{\text{Base}} + \alpha\, \mathcal{L}_{\text{D-SINK}}$

where $\mathcal{L}_{\text{Base}}$ is a standard classification loss (e.g., cross-entropy) and the regularization term is

$\mathcal{L}_{\text{D-SINK}} = \frac{1}{N} \sum_{i=1}^{N} [D_{\text{KL}}(q_i, f_N(x_i)) + D_{\text{KL}}(q_i, f(x_i))]$

The Sinkhorn allocation of Q is performed in each batch; backpropagation is carried out with respect to $f$ 's parameters only since $f_L$ and $f_N$ are fixed.

Algorithmically, D-SINK alternates three steps per iteration: (1) generate batch predictions from $f, f_L, f_N$ ; (2) compute surrogate labels $Q$ through Sinkhorn optimization; (3) update $f$ to minimize the composite loss above.

5. Empirical Results and Analysis

Experiments conducted on benchmark datasets (CIFAR-10, CIFAR-100, CIFAR-N, Clothing1M, Red Mini-ImageNet) with varied imbalance and noise ratios demonstrate that D-SINK consistently outperforms:

Long-tailed only methods (LA, LDAM, IB, RoLT)
Noisy-label only methods (DivideMix, UNICON)
Combinations of individually stronger baselines, and direct multi-teacher ensemble methods

D-SINK is especially effective at recovering performance in tail classes under heavy imbalance, while remaining robust to high rates of label noise. Ablation studies further reveal that the gains persist even when the auxiliary models $f_L$ and $f_N$ are relatively weak, indicating the critical role of the dual-granularity architecture and Sinkhorn-based allocation, rather than simply architectural complexity.

Visualization (e.g., t-SNE plots) demonstrates that D-SINK-trained models yield feature embeddings with enhanced class separation compared to standard cross-entropy or alternative robust learning techniques.

6. Theoretical Properties and Statistical Guarantees

The Sinkhorn regularization ensures that the surrogate label allocation is unique and stable due to the strict convexity of the entropy term and the Sinkhorn solution's differentiability. The two-level alignment (distribution and sample) avoids the conflicts inherent in single-teacher approaches; information about minority classes is retained without amplifying the effect of noisy samples.

The framework can be interpreted as a structured loss transfer mechanism where Sinkhorn's optimal transport plan mediates between localized robustness (noise filtering) and global balance (class marginal adjustment).

7. Extensions and Future Directions

The modularity of D-SINK facilitates combinations with more sophisticated auxiliary models or future advances in either imbalance or noise robustness. Extensions considered include:

Multi-label classification, open-set recognition, and adaptation to domain shift
Generalization beyond classification, potentially encompassing regression or structured prediction where distributional and instance-level imperfections interact
Theoretical investigations of convergence behavior and statistical properties in high-dimensional, real-world settings

The dual-granularity principle generalizes to any setting involving heterogeneous data imperfections likely to operate at mutually orthogonal levels of granularity.

D-SINK represents a conceptually simple yet empirically and theoretically robust paradigm for learning under joint class imbalance and label noise, efficiently integrating complementary sources of weak supervision by leveraging entropy-regularized optimal transport and Sinkhorn algorithms for surrogate label allocation (Hong et al., 9 Oct 2025). Its technical soundness derives from the high-order smoothness, efficiency, and modularity afforded by the Sinkhorn framework and OT theory.

Markdown Report Issue Upgrade to Chat

References (1)

Dual-granularity Sinkhorn Distillation for Enhanced Learning from Long-tailed Noisy Data (2025)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Dual-granularity Sinkhorn Distillation (D-SINK).