Relearning Forgotten Knowledge: on Forgetting, Overfit and Training-Free Ensembles of DNNs (2310.11094v2)
Abstract: The infrequent occurrence of overfit in deep neural networks is perplexing. On the one hand, theory predicts that as models get larger they should eventually become too specialized for a specific training set, with ensuing decrease in generalization. In contrast, empirical results in image classification indicate that increasing the training time of deep models or using bigger models almost never hurts generalization. Is it because the way we measure overfit is too limited? Here, we introduce a novel score for quantifying overfit, which monitors the forgetting rate of deep models on validation data. Presumably, this score indicates that even while generalization improves overall, there are certain regions of the data space where it deteriorates. When thus measured, we show that overfit can occur with and without a decrease in validation accuracy, and may be more common than previously appreciated. This observation may help to clarify the aforementioned confusing picture. We use our observations to construct a new ensemble method, based solely on the training history of a single network, which provides significant improvement in performance without any additional cost in training time. An extensive empirical evaluation with modern deep models shows our method's utility on multiple datasets, neural networks architectures and training schemes, both when training from scratch and when using pre-trained networks in transfer learning. Notably, our method outperforms comparable methods while being easier to implement and use, and further improves the performance of competitive networks on Imagenet by 1%.
- Annavarapu, C. S. R. 2021. Deep learning-based improved snapshot ensemble technique for COVID-19 chest X-ray classification. Applied Intelligence, 51: 3104–3120.
- Reconciling modern machine-learning practice and the classical bias–variance trade-off. Proceedings of the National Academy of Sciences, 116(32): 15849–15854.
- Imagenet: A large-scale hierarchical image database. In 2009 IEEE conference on computer vision and pattern recognition, 248–255. Ieee.
- An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929.
- Ensemble deep learning: A review. Engineering Applications of Artificial Intelligence, 115: 105151.
- Loss surfaces, mode connectivity, and fast ensembling of dnns. Advances in neural information processing systems, 31.
- Stochastic weight averaging revisited. Applied Sciences, 13(5): 2935.
- Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, 770–778.
- Early stopping in deep networks: Double descent and how to eliminate it. arXiv preprint arXiv:2007.10099.
- Snapshot ensembles: Train 1, get m for free. arXiv preprint arXiv:1704.00109.
- Averaging weights leads to wider optima and better generalization. arXiv preprint arXiv:1803.05407.
- Learning multiple layers of features from tiny images.
- Tiny imagenet visual recognition challenge. CS 231N, 7(7): 3.
- A ConvNet for the 2020s. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).
- Catastrophic interference in connectionist networks: The sequential learning problem. In Psychology of learning and motivation, volume 24, 109–165. Elsevier.
- Deep double descent: Where bigger models and more data hurt. Journal of Statistical Mechanics: Theory and Experiment, 2021(12): 124003.
- Lung sound classification using snapshot ensemble of convolutional neural networks. In 2020 42nd Annual International Conference of the IEEE Engineering in Medicine & Biology Society (EMBC), 760–763. IEEE.
- dropCyclic: snapshot ensemble convolutional neural network based on a new learning rate schedule for land use classification. IEEE Access, 10: 60725–60737.
- Making deep neural networks robust to label noise: A loss correction approach. In Proceedings of the IEEE conference on computer vision and pattern recognition, 1944–1952.
- Polikar, R. 2012. Ensemble learning. Ensemble machine learning: Methods and applicannavarapu2021deepations, 1–34.
- Acceleration of stochastic approximation by averaging. SIAM journal on control and optimization, 30(4): 838–855.
- Ratcliff, R. 1990. Connectionist models of recognition memory: constraints imposed by learning and forgetting functions. Psychological review, 97(2): 285.
- A snapshot ensemble deep neural network model for attack detection in industrial internet of things. AI-Enabled Threat Detection and Security Analysis for Industrial IoT, 181–194.
- Selfie: Refurbishing unclean samples for robust deep learning. In International Conference on Machine Learning, 5907–5915. PMLR.
- Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research, 15(1): 1929–1958.
- When and how epochwise double descent happens. arXiv preprint arXiv:2108.12006.
- An empirical study of example forgetting during deep neural network learning. arXiv preprint arXiv:1812.05159.
- TorchVision. 2016. TorchVision: PyTorch’s Computer Vision library. https://github.com/pytorch/vision.
- Maxvit: Multi-axis vision transformer. In Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXIV, 459–479. Springer.
- Towards fairness in visual recognition: Effective strategies for bias mitigation. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 8919–8928.
- A new snapshot ensemble convolutional neural network for fault diagnosis. Ieee Access, 7: 32037–32047.
- Horizontal and vertical ensemble with deep representation for classification. arXiv preprint arXiv:1306.2759.
- A survey on ensemble learning under the era of deep learning. Artificial Intelligence Review, 56(6): 5545–5589.
- Men also like shopping: Reducing gender bias amplification using corpus-level constraints. arXiv preprint arXiv:1707.09457.