Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
125 tokens/sec
GPT-4o
53 tokens/sec
Gemini 2.5 Pro Pro
42 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

L_DMI: An Information-theoretic Noise-robust Loss Function (1909.03388v2)

Published 8 Sep 2019 in cs.LG, cs.CV, and stat.ML

Abstract: Accurately annotating large scale dataset is notoriously expensive both in time and in money. Although acquiring low-quality-annotated dataset can be much cheaper, it often badly damages the performance of trained models when using such dataset without particular treatment. Various methods have been proposed for learning with noisy labels. However, most methods only handle limited kinds of noise patterns, require auxiliary information or steps (e.g. , knowing or estimating the noise transition matrix), or lack theoretical justification. In this paper, we propose a novel information-theoretic loss function, $\mathcal{L}{DMI}$, for training deep neural networks robust to label noise. The core of $\mathcal{L}{DMI}$ is a generalized version of mutual information, termed Determinant based Mutual Information (DMI), which is not only information-monotone but also relatively invariant. \emph{To the best of our knowledge, $\mathcal{L}{DMI}$ is the first loss function that is provably robust to instance-independent label noise, regardless of noise pattern, and it can be applied to any existing classification neural networks straightforwardly without any auxiliary information}. In addition to theoretical justification, we also empirically show that using $\mathcal{L}{DMI}$ outperforms all other counterparts in the classification task on both image dataset and natural language dataset include Fashion-MNIST, CIFAR-10, Dogs vs. Cats, MR with a variety of synthesized noise patterns and noise amounts, as well as a real-world dataset Clothing1M. Codes are available at https://github.com/Newbeeer/L_DMI .

Citations (54)

Summary

  • The paper introduces DMI, an extended mutual information measure that effectively mitigates the impact of label noise.
  • It proposes the L_DMI loss function, theoretically validated to ensure the ground truth classifier minimizes loss even under severe noise.
  • Experiments on benchmarks like Fashion-MNIST and CIFAR-10 confirm L_DMI's resilience against non-diagonally dominant noise, ensuring stable performance.

Overview of the Paper on $\mathcal{L}_{\dmi}$: A Novel Information-Theoretic Loss Function for Robust Training

This paper introduces a novel information-theoretic loss function, $\mathcal{L}_{\dmi}$, for training deep neural networks that are robust to label noise. The work addresses the challenge of training models on large-scale datasets where accurate annotations are costly, and low-quality labels can degrade model performance. Traditional methods for dealing with noisy labels often fall short, either due to handling limited noise patterns, the need for auxiliary information, or a lack of theoretical grounding.

Key Contributions

  1. Introduction of DMI: The core innovation is the Determinant based Mutual Information (DMI), a generalized mutual information measure. DMI extends Shannon's mutual information by being both information-monotone and relatively invariant, providing robust measurement of correlation between variables even under noise.
  2. Novel Loss Function $\mathcal{L}_{\dmi}$: Building upon DMI, the authors propose $\mathcal{L}_{\dmi}$, a loss function that uniquely handles instance-independent label noise without auxiliary data. By leveraging the properties of DMI, $\mathcal{L}_{\dmi}$ maintains robustness across various noise patterns, including diagonally non-dominant noise.
  3. Theoretical Validation: The paper includes rigorous theoretical analysis demonstrating that $\mathcal{L}_{\dmi}$ is both legal (guaranteeing that the ground truth classifier has the lowest loss) and noise-robust (ensuring equivalent performance when trained on noisy or clean data).
  4. Empirical Evidence: Comprehensive experiments on several benchmarks, such as Fashion-MNIST, CIFAR-10, and real-world data like Clothing1M, exhibit the effectiveness of $\mathcal{L}_{\dmi}$ against other methods like cross-entropy and GCE, especially in settings with severe label noise.

Numerical Results and Observations

The experiments show that $\mathcal{L}_{\dmi}$ consistently outperforms or competes with state-of-the-art methods across various levels and types of label noise. Key observations include:

  • In scenarios with non-diagonally dominant noise, where traditional distance-based losses succumb to model bias towards class imbalances, $\mathcal{L}_{\dmi}$ maintains superior accuracy.
  • $\mathcal{L}_{\dmi}$ shows minimal performance degradation as noise levels increase, highlighting its robustness.

Implications and Future Directions

The implications of introducing $\mathcal{L}_{\dmi}$ are profound for both theoretical research and practical deployment of machine learning systems in noisy environments. Practically, it can be immediately applied to existing architectures without additional data or noise estimation, streamlining processes where noise-robustness is critical, such as in medical imaging or crowdsourced datasets.

Theoretically, this work opens avenues to explore more generalistic formulations of mutual information for other types of noise and dependencies beyond the instance-independent assumption. Future research could also investigate optimizations in the training procedure aligned with DMI's theoretical underpinnings to further enhance model performance in clean data conditions.

Overall, the paper makes significant strides toward robust and theoretically justified approaches in training models under practical constraints, marking a meaningful addition to the field of machine learning both in theory and application.