What Neural Networks Memorize and Why: Discovering the Long Tail via Influence Estimation (2008.03703v1)

Published 9 Aug 2020 in cs.LG and stat.ML

Abstract: Deep learning algorithms are well-known to have a propensity for fitting the training data very well and often fit even outliers and mislabeled data points. Such fitting requires memorization of training data labels, a phenomenon that has attracted significant research interest but has not been given a compelling explanation so far. A recent work of Feldman (2019) proposes a theoretical explanation for this phenomenon based on a combination of two insights. First, natural image and data distributions are (informally) known to be long-tailed, that is have a significant fraction of rare and atypical examples. Second, in a simple theoretical model such memorization is necessary for achieving close-to-optimal generalization error when the data distribution is long-tailed. However, no direct empirical evidence for this explanation or even an approach for obtaining such evidence were given. In this work we design experiments to test the key ideas in this theory. The experiments require estimation of the influence of each training example on the accuracy at each test example as well as memorization values of training examples. Estimating these quantities directly is computationally prohibitive but we show that closely-related subsampled influence and memorization values can be estimated much more efficiently. Our experiments demonstrate the significant benefits of memorization for generalization on several standard benchmarks. They also provide quantitative and visually compelling evidence for the theory put forth in (Feldman, 2019).

Authors (2)

Vitaly Feldman (71 papers)
Chiyuan Zhang (57 papers)

Citations (402)

View on Semantic Scholar

Summary

Analysis of Neural Network Memorization via Influence Estimation

The presented paper provides a robust empirical examination of the concept of label memorization in neural networks, focusing primarily on understanding why and how neural networks memorize data. Feldman's prior theoretical work postulated that memorization is essential for minimizing generalization error in long-tailed data distributions, such as those often seen in real-world datasets. This research takes the significant step of empirically validating this hypothesis through the development of influence and memorization estimators.

Key Contributions

Definition and Estimation of Memorization: The paper rigorously defines memorization using the notion of label memorization, quantified as the probability change in the model’s prediction upon inclusion of a particular training example. The influence estimator developed is computationally efficient and harnesses the idea of subsampling to approximate leave-one-out predictions, efficiently addressing the computational challenges inherent in direct estimation.
Experimental Validation: A series of experiments were conducted on MNIST, CIFAR-100, and ImageNet datasets to evaluate the role of memorized examples in model accuracy. Results indicate a significant portion of training data participate in memorization, whereby their removal deteriorates model performance more than the removal of an equivalent random subset.
Influence and Memorization Dynamics: Interestingly, the paper finds that the memorized examples often include atypical or mislabelled instances. Such instances significantly enhance accuracy on similar atypical test examples, suggesting that these memorized examples serve as representatives of rare subpopulations. This serves as empirical support for the long tail theory, which posits that memorization enhances performance in datasets characterized by rare instances.
Architectural Consistency: Through comparisons across different architectures such as ResNet, Inception, and DenseNet, the research demonstrates consistency in memorization patterns, suggesting that network architecture predominantly influences accuracy levels rather than memorization dynamics per se.

Implications and Future Directions

The findings have several immediate and far-reaching implications. Practically, the results imply that constraints on memorization—such as those introduced through privacy measures or model compression—could disproportionately affect model performance on underrepresented data subpopulations. Therefore, techniques limiting memorization should be cautiously employed, especially in applications sensitive to such biases.

From a theoretical standpoint, this empirical validation of the long tail theory offers a concrete explanation for the curious propensity of deep networks to memorize labels. It elucidates that memorization is not merely a by-product of high-capacity models but a functional component that enhances specific aspects of generalization.

Moving forward, further exploration could be directed towards understanding the mechanics of memorization within different model layers or architectures. Additionally, developing more computationally feasible ways to measure influence and memorization without extensive training runs would be beneficial in extending such analyses to larger and more complex datasets.

In summary, this paper significantly advances our understanding of neural network behavior by connecting theoretical predictions with empirical evidence, highlighting the nuanced benefits of memorization for generalization within long-tailed data distributions. Such insights pave the way for more nuanced model training strategies that optimally balance memorization and generalization.

Related Papers

Find Related Papers

YouTube

Show All Videos