Not All Samples Are Created Equal: Deep Learning with Importance Sampling (1803.00942v3)

Published 2 Mar 2018 in cs.LG

Abstract: Deep neural network training spends most of the computation on examples that are properly handled, and could be ignored. We propose to mitigate this phenomenon with a principled importance sampling scheme that focuses computation on "informative" examples, and reduces the variance of the stochastic gradients during training. Our contribution is twofold: first, we derive a tractable upper bound to the per-sample gradient norm, and second we derive an estimator of the variance reduction achieved with importance sampling, which enables us to switch it on when it will result in an actual speedup. The resulting scheme can be used by changing a few lines of code in a standard SGD procedure, and we demonstrate experimentally, on image classification, CNN fine-tuning, and RNN training, that for a fixed wall-clock time budget, it provides a reduction of the train losses of up to an order of magnitude and a relative improvement of test errors between 5% and 17%.

Authors (2)

Angelos Katharopoulos (12 papers)
François Fleuret (78 papers)

Citations (476)

View on Semantic Scholar

Summary

The paper presents a tractable upper bound for per-sample gradient norms that enables efficient importance sampling during training.
It introduces a variance reduction estimator that dynamically activates sampling, yielding up to an order of magnitude reduction in training loss and 5%-17% improvement in test errors.
The method integrates seamlessly with standard SGD, demonstrating practical gains in training efficiency across CNNs, RNNs, and other neural network architectures.

Deep Learning with Importance Sampling

The paper "Not All Samples Are Created Equal: \ Deep Learning with Importance Sampling" by Angelos Katharopoulos and François Fleuret presents a novel approach to optimize deep learning model training through a principled importance sampling scheme. This method aims to prioritize computational resources on informative samples, leading to variance reduction in stochastic gradients and improved training efficiency.

Core Contributions

The paper outlines two primary contributions:

Tractable Upper Bound Derivation: The authors introduce a computationally efficient upper bound for the per-sample gradient norm. This bound can be calculated using a single forward pass, significantly reducing the computational overhead compared to previous methods that rely on the gradient norm.
Variance Reduction Estimator: An estimator is presented for the variance reduction achieved by the importance sampling. This estimator enables the method to dynamically activate importance sampling when it is beneficial, thereby achieving better resource allocation during the training process.

Experimental Results

The methodology was evaluated on a range of tasks, including image classification using CNNs, fine-tuning of neural networks, and sequence classification with RNNs. The experimental results are compelling:

The proposed approach achieves a reduction in training losses up to an order of magnitude.
Improvements in test errors range from 5% to 17% compared to standard SGD procedures with uniform sampling.
Importantly, these improvements are obtained with only a few changes to existing SGD code, showcasing the method's practicality and ease of integration.

Implications and Future Directions

The research indicates a significant impact on computational cost management in training deep neural networks. By focusing on informative samples, the method mitigates unnecessary computational expenses on samples that offer minimal influence on the training outcome. This approach is particularly advantageous as the size of training datasets continues to grow.

The paper opens multiple avenues for further research:

Dynamic Learning Rate Adjustments:

The approach suggests potential for integrating learning rate adjustments based on gradient variance, allowing more refined control over the training dynamics.

Batch Size Optimization:

Investigating the implications of adaptive batch sizes, wherein the batch size can be dynamically adjusted in response to the estimated variance reduction, could offer additional computational benefits.

Application to Other Domains:

While this paper focused on vision and sequence data, extending this investigation to other domains, such as reinforcement learning and unsupervised learning, can further enhance its applicability and robustness.

Conclusion

This paper contributes a methodologically rigorous and practically applicable technique for enhancing the efficiency of deep learning training processes. By leveraging importance sampling based on a novel upper bound computation, the authors provide a method that is not only theoretically sound but also exhibits tangible improvements in training efficiency across various neural network architectures and tasks. This work stands out as a clear example of how theoretical innovations can translate into meaningful computational advancements in machine learning.