Papers
Topics
Authors
Recent
Search
2000 character limit reached

Dual-Granularity Sinkhorn Distillation (D-SINK)

Updated 11 October 2025
  • Dual-Granularity Sinkhorn Distillation (D-SINK) is a robust learning method that addresses class imbalance and label noise by combining insights from two specialized auxiliary models.
  • It leverages a dual-granularity framework where imbalance-robust and noise-robust teachers guide surrogate label allocation via entropy-regularized optimal transport using the Sinkhorn algorithm.
  • Empirical results on benchmark datasets show that D-SINK significantly improves performance on minority classes and maintains resilience against high rates of label noise.

Dual-granularity Sinkhorn Distillation (D-SINK) is a methodology for enhancing robustness and representational fidelity in learning problems characterized by heterogeneous data imperfections, notably the co-occurrence of class imbalance and label noise. The approach operates by distilling complementary information from specialized auxiliary models operating at distinct granularity levels—distributional (class-level) and sample-level—via an optimal transport framework leveraging the Sinkhorn algorithm. This article presents the theoretical foundations, optimization strategies, empirical results, and implications for robust learning as described in (Hong et al., 9 Oct 2025), along with context from related optimal transport and Sinkhorn theory.

1. Motivation and Problem Setting

The challenge addressed by D-SINK arises from the simultaneous presence of class imbalance (where some classes have far fewer samples, resulting in non-uniform class marginal distributions) and label noise (where individual instances have unreliable or corrupted labels). Existing algorithms targeting either issue (e.g., reweighting methods for imbalance, loss correction or sample-selection for noise) can inadvertently undermine performance when naively combined, as strategies to strengthen tail classes may magnify the impact of noisy labels and vice versa.

D-SINK exploits the observation that these two issues pertain to fundamentally different data granularities: class imbalance is a property of the overall label distribution, while label noise is an instance-level concern. The method proposes to combine the strengths of two weak auxiliary models—one optimized for class imbalance ("imbalance-robust teacher") and another for label noise ("noise-robust teacher")—by distilling their insights into the target model through a surrogate label allocation process optimized with Sinkhorn-based optimal transport.

2. Dual-Granularity Framework Architecture

D-SINK maintains three components:

  • Imbalance-robust auxiliary model fLf_L: Trained with state-of-the-art methods for long-tailed learning, supplying reliable class marginal predictions.
  • Noise-robust auxiliary model fNf_N: Trained with robust learning under noisy labels, providing trustworthy sample-level predictions.
  • Target model ff: Learns from dynamically constructed surrogate labels {qi}i=1N\{q_i\}_{i=1}^N (probability vectors on the simplex) generated in each training batch.

The surrogate label allocation mechanism instantiates the "dual granularity": for each instance ii, qiq_i is optimized to be close to the output of fN(xi)f_N(x_i) (noise robustness), while the overall sum i=1Nqi\sum_{i=1}^N q_i is constrained to resemble the global class distribution predicted by fLf_L (imbalance robustness). Thus, the alignment is enforced at both sample and class levels without requiring simultaneous optimization of both properties in a single teacher network.

3. Sinkhorn-Optimized Surrogate Label Assignment

The construction of the label matrix Q=[q1,,qN]TQ = [q_1, \ldots, q_N]^T proceeds by solving a regularized optimal transport problem:

minQQ,P+2i=1Nqilogqi\min_Q \langle Q, P \rangle + 2\sum_{i=1}^N q_i \cdot \log q_i

subject to the constraints

Q1N=i=1NfL(xi),QT1C=1NQ \cdot \mathbf{1}_N = \sum_{i=1}^N f_L(x_i), \quad Q^T \cdot \mathbf{1}_C = \mathbf{1}_N

where PP is a cost matrix whose iith row is logfN(xi)logf(xi)-\log f_N(x_i) - \log f(x_i), and CC is the number of classes. The solution QQ is obtained via entropy-regularized OT and computed efficiently with the Sinkhorn-Knopp algorithm:

  • Set M=exp(P/2)M = \exp(-P/2)
  • Iteratively update scaling vectors u,vu, v using u(ifL(xi))/(Mv),  v1N/(MTu)u \leftarrow (\sum_i f_L(x_i)) / (M v), \; v \leftarrow \mathbf{1}_N / (M^T u)
  • Set Q=NBdiag(u)Mdiag(v)Q = N_B \cdot \operatorname{diag}(u) M \operatorname{diag}(v)

This procedure simultaneously imposes sample-level proximity to fN(xi)f_N(x_i), and global distributional consistency with the marginal predicted by fLf_L. The entropic regularization ensures uniqueness and numerical stability.

4. Optimization Objective and Implementation

The total training loss for D-SINK over a mini-batch is:

LOverall=LBase+αLD-SINK\mathcal{L}_{\text{Overall}} = \mathcal{L}_{\text{Base}} + \alpha\, \mathcal{L}_{\text{D-SINK}}

where LBase\mathcal{L}_{\text{Base}} is a standard classification loss (e.g., cross-entropy) and the regularization term is

LD-SINK=1Ni=1N[DKL(qi,fN(xi))+DKL(qi,f(xi))]\mathcal{L}_{\text{D-SINK}} = \frac{1}{N} \sum_{i=1}^{N} [D_{\text{KL}}(q_i, f_N(x_i)) + D_{\text{KL}}(q_i, f(x_i))]

The Sinkhorn allocation of Q is performed in each batch; backpropagation is carried out with respect to ff's parameters only since fLf_L and fNf_N are fixed.

Algorithmically, D-SINK alternates three steps per iteration: (1) generate batch predictions from f,fL,fNf, f_L, f_N; (2) compute surrogate labels QQ through Sinkhorn optimization; (3) update ff to minimize the composite loss above.

5. Empirical Results and Analysis

Experiments conducted on benchmark datasets (CIFAR-10, CIFAR-100, CIFAR-N, Clothing1M, Red Mini-ImageNet) with varied imbalance and noise ratios demonstrate that D-SINK consistently outperforms:

  • Long-tailed only methods (LA, LDAM, IB, RoLT)
  • Noisy-label only methods (DivideMix, UNICON)
  • Combinations of individually stronger baselines, and direct multi-teacher ensemble methods

D-SINK is especially effective at recovering performance in tail classes under heavy imbalance, while remaining robust to high rates of label noise. Ablation studies further reveal that the gains persist even when the auxiliary models fLf_L and fNf_N are relatively weak, indicating the critical role of the dual-granularity architecture and Sinkhorn-based allocation, rather than simply architectural complexity.

Visualization (e.g., t-SNE plots) demonstrates that D-SINK-trained models yield feature embeddings with enhanced class separation compared to standard cross-entropy or alternative robust learning techniques.

6. Theoretical Properties and Statistical Guarantees

The Sinkhorn regularization ensures that the surrogate label allocation is unique and stable due to the strict convexity of the entropy term and the Sinkhorn solution's differentiability. The two-level alignment (distribution and sample) avoids the conflicts inherent in single-teacher approaches; information about minority classes is retained without amplifying the effect of noisy samples.

The framework can be interpreted as a structured loss transfer mechanism where Sinkhorn's optimal transport plan mediates between localized robustness (noise filtering) and global balance (class marginal adjustment).

7. Extensions and Future Directions

The modularity of D-SINK facilitates combinations with more sophisticated auxiliary models or future advances in either imbalance or noise robustness. Extensions considered include:

  • Multi-label classification, open-set recognition, and adaptation to domain shift
  • Generalization beyond classification, potentially encompassing regression or structured prediction where distributional and instance-level imperfections interact
  • Theoretical investigations of convergence behavior and statistical properties in high-dimensional, real-world settings

The dual-granularity principle generalizes to any setting involving heterogeneous data imperfections likely to operate at mutually orthogonal levels of granularity.


D-SINK represents a conceptually simple yet empirically and theoretically robust paradigm for learning under joint class imbalance and label noise, efficiently integrating complementary sources of weak supervision by leveraging entropy-regularized optimal transport and Sinkhorn algorithms for surrogate label allocation (Hong et al., 9 Oct 2025). Its technical soundness derives from the high-order smoothness, efficiency, and modularity afforded by the Sinkhorn framework and OT theory.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Dual-granularity Sinkhorn Distillation (D-SINK).