Deep Learning on a Data Diet: Finding Important Examples Early in Training (2107.07075v2)

Published 15 Jul 2021 in cs.LG

Abstract: Recent success in deep learning has partially been driven by training increasingly overparametrized networks on ever larger datasets. It is therefore natural to ask: how much of the data is superfluous, which examples are important for generalization, and how do we find them? In this work, we make the striking observation that, in standard vision datasets, simple scores averaged over several weight initializations can be used to identify important examples very early in training. We propose two such scores -- the Gradient Normed (GraNd) and the Error L2-Norm (EL2N) scores -- and demonstrate their efficacy on a range of architectures and datasets by pruning significant fractions of training data without sacrificing test accuracy. In fact, using EL2N scores calculated a few epochs into training, we can prune half of the CIFAR10 training set while slightly improving test accuracy. Furthermore, for a given dataset, EL2N scores from one architecture or hyperparameter configuration generalize to other configurations. Compared to recent work that prunes data by discarding examples that are rarely forgotten over the course of training, our scores use only local information early in training. We also use our scores to detect noisy examples and study training dynamics through the lens of important examples -- we investigate how the data distribution shapes the loss surface and identify subspaces of the model's data representation that are relatively stable over training.

Authors (3)

Mansheej Paul (12 papers)
Surya Ganguli (73 papers)
Gintare Karolina Dziugaite (54 papers)

Citations (369)

View on Semantic Scholar

Summary

The paper introduces a data pruning approach using EL2N and GraNd scores to identify and retain influential training examples early in the process.
It demonstrates that early score computation can prune up to 50% of redundant data while maintaining high test accuracy across different architectures.
The paper provides insights into training dynamics and offers a framework for noise handling and future research in efficient, noise-robust model design.

Overview of "Deep Learning on a Data Diet: Finding Important Examples Early in Training"

This essay critically examines a paper that explores the potential for data pruning in deep learning, helping to identify examples that significantly contribute to improving a model's generalization performance early in the training process. The authors investigate methods aimed at reducing training data without substantially impacting the model's ability to perform on test data, motivated by the increasing computational demands in training overparameterized deep learning models on extensive datasets.

Key Concepts and Methodologies

The key hypotheses in this paper are underpinned by two proposed scores: the Gradient Normed score (GraNd) and the Error L2-Norm score (EL2N). These scores are designed to identify vital examples early in the training process by computing simple metrics related to gradient and error norms:

GraNd Score: This score represents the expected magnitude of loss gradients. The intuition is that examples with larger gradients may exert a greater influence on model updates.
EL2N Score: This score is related to the norm of prediction errors, calculated as the difference between softmax outputs and one-hot encoded labels. The appeal lies in its simplicity and effectiveness in finding examples that are critical for distinguishing model decision boundaries.

Empirical Results

The authors conduct extensive experiments using standard vision benchmarks such as CIFAR-10, CIFAR-100, and CINIC-10. There are several noteworthy empirical findings:

Data Pruning Efficiency: By using EL2N scores calculated at an early training stage, they successfully prune substantial portions of the training dataset (e.g., up to 50% in CIFAR-10) while maintaining or even enhancing test accuracy. This demonstrates that substantial redundancy exists in the training data.
Early Identification of Crucial Examples: Notably, EL2N scores computed within a few training epochs outperformed more comprehensive, trajectory-based methods like forgetting events. This underscores the significance of local information captured in the early training dynamics.
Effect of Label Noise: The paper reveals that examples with the highest EL2N scores often correspond to data affected by label noise. Excluding these through careful analysis can enhance model performance, especially when label noise is present.
Generality Across Architectures: The importance ranking obtained from these scores is not highly sensitive to specific architectures. For instance, scores computed using one architecture such as ResNet18 can be effectively applied when training another architecture, like ResNet50, indicating the robustness of the scoring methodology.

Implications

This work delivers multiple practical and theoretical implications:

Efficiency in Training: By identifying and eliminating redundant data early on, computational resources can be better allocated, contributing to efficiency in training new models, especially critical in resource-constrained environments.
Understanding Training Dynamics: The ability to connect example importance with training dynamics deepens our understanding of the roles of various data subsets, including how different data distributions inform neural network behavior and training trajectories.
Applications in Noise Handling: The approach provides a potential framework for detecting noisy examples, thereby contributing to noise-robust learning strategies.

Future Directions

While the paper provides robust results, it invites several avenues for future research. Further theoretical investigation into why simple metrics like EL2N and GraNd hold such predictive power could illuminate more generalized capabilities of deep neural networks. Moreover, examining these strategies in contexts beyond image classification, such as NLP or reinforcement learning, could extend their applicability. Potential fusion with active learning paradigms may also be explored to optimize data labeling processes iteratively.

In conclusion, this paper signifies a meaningful step in understanding data importance and optimizing computational overhead in training deep learning models through data pruning strategies.

Related Papers

Tweets

https://twitter.com/roydanroy/status/1859139727250657479

https://twitter.com/roydanroy/status/1806428705956200588