Comparing Importance Sampling Based Methods for Mitigating the Effect of Class Imbalance (2402.18742v1)
Abstract: Most state-of-the-art computer vision models heavily depend on data. However, many datasets exhibit extreme class imbalance which has been shown to negatively impact model performance. Among the training-time and data-generation solutions that have been explored, one subset that leverages existing data is importance sampling. A good deal of this work focuses primarily on the CIFAR-10 and CIFAR-100 datasets which fail to be representative of the scale, composition, and complexity of current state-of-the-art datasets. In this work, we explore and compare three techniques that derive from importance sampling: loss reweighting, undersampling, and oversampling. Specifically, we compare the effect of these techniques on the performance of two encoders on an impactful satellite imagery dataset, Planet's Amazon Rainforest dataset, in preparation for another work. Furthermore, we perform supplemental experimentation on a scene classification dataset, ADE20K, to test on a contrasting domain and clarify our results. Across both types of encoders, we find that up-weighting the loss for and undersampling has a negigible effect on the performance on underrepresented classes. Additionally, our results suggest oversampling generally improves performance for the same underrepresented classes. Interestingly, our findings also indicate that there may exist some redundancy in data in the Planet dataset. Our work aims to provide a foundation for further work on the Planet dataset and similar domain-specific datasets. We open-source our code at https://github.com/RichardZhu123/514-class-imbalance for future work on other satellite imagery datasets as well.
- Semantic redundancies in image-classification datasets: The 10% you don’t need. CoRR, abs/1901.11409, 2019.
- Maximizing human effort for analyzing scientific images: A case study using digitized herbarium sheets. Applications in plant sciences, 8(6):e11370–e11370, 2020.
- A systematic study of the class imbalance problem in convolutional neural networks. Neural Networks, 106:249–259, 2018.
- Gender Shades: Intersectional Accuracy Disparities in Commercial Gender Classification. In Proceedings of the 1st Conference on Fairness, Accountability and Transparency, pages 77–91. PMLR, 2018.
- What is the Effect of Importance Weighting in Deep Learning? In Proceedings of the 36th International Conference on Machine Learning, pages 872–881. PMLR, 2019.
- Dataset Distillation by Matching Training Trajectories. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 10718–10727, 2022.
- Generative Adversarial Networks: An Overview. IEEE Signal Processing Magazine, 35(1):53–65, 2018.
- Class-Balanced Loss Based on Effective Number of Samples, 2019. arXiv:1901.05555 [cs].
- Cost-Sensitive Learning. In Learning from Imbalanced Data Sets, pages 63–78. Springer International Publishing, Cham, 2018.
- Interpretable Explanations of Black Boxes by Meaningful Perturbation. In International Conference on Computer Vision (ICCV), 2017.
- Planet: Understanding the Amazon from Space, 2017.
- RHSBoost: Improving classification performance in imbalance data. Computational Statistics & Data Analysis, 111:1–13, 2017.
- Generative Adversarial Nets. In Advances in Neural Information Processing Systems. Curran Associates, Inc., 2014.
- LVIS: A Dataset for Large Vocabulary Instance Segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2019.
- Deep residual learning for image recognition. In 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 770–778, 2016.
- Big Data fraud detection using multiple medicare data sources. Journal of Big Data, 5(1):29, 2018.
- The iNaturalist Species Classification and Detection Dataset. In 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 8769–8778, Los Alamitos, CA, USA, 2018. IEEE Computer Society.
- A generalization of sampling without replacement from a finite universe. Journal of the American Statistical Association, 47(260):663–685, 1952.
- Learning deep representation for imbalanced classification. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016.
- A Survey on Contrastive Self-Supervised Learning. Technologies, 9(1), 2021.
- Improving the intra-class long-tail in 3d detection via rare example mining. In Computer Vision – ECCV 2022, pages 158–175, Cham, 2022. Springer Nature Switzerland.
- Survey on deep learning with class imbalance. Journal of Big Data, 6(1):27, 2019.
- Methods of reducing sample size in monte carlo computations. Journal of the Operations Research Society of America, 1(5):263–278, 1953.
- Learning From Less Data: A Unified Data Subset Selection and Active Learning Framework for Computer Vision. In 2019 IEEE Winter Conference on Applications of Computer Vision (WACV), pages 1289–1299, 2019.
- Undoing the damage of dataset bias. In Computer Vision – ECCV 2012, pages 158–171, Berlin, Heidelberg, 2012. Springer Berlin Heidelberg.
- Probabilistic Graphical Models: Principles and Techniques - Adaptive Computation and Machine Learning. The MIT Press, 2009.
- M. Kubat. Addressing the Curse of Imbalanced Training Sets: One-Sided Selection. Fourteenth International Conference on Machine Learning, 2000.
- Machine Learning for the Detection of Oil Spills in Satellite Radar Images. Machine Learning, 30(2):195–215, 1998.
- Adversarial examples in the physical world, 2017.
- Dong-Hyun Lee. Pseudo-label : The simple and efficient semi-supervised learning method for deep neural networks. ICML 2013 Workshop : Challenges in Representation Learning (WREPL), 2013.
- Exploratory Undersampling for Class-Imbalance Learning. IEEE Transactions on Systems, Man, and Cybernetics, Part B (Cybernetics), 39(2):539–550, 2009.
- Large-Scale Long-Tailed Recognition in an Open World. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2019.
- The impact of class imbalance in classification performance metrics based on the binary confusion matrix. Pattern Recognition, 91:216–231, 2019.
- Understanding Deep Image Representations by Inverting Them. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2015.
- Deep Neural Networks are Easily Fooled: High Confidence Predictions for Unrecognizable Images. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2015.
- Dataset Distillation with Infinitely Wide Convolutional Networks. In Advances in Neural Information Processing Systems, pages 5186–5198. Curran Associates, Inc., 2021.
- Long-Tail Recognition via Compositional Knowledge Transfer. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 6939–6948, 2022.
- Class imbalance revisited: a new experimental setup to assess the performance of treatment methods. Knowledge and Information Systems, 45(1):247–270, 2015.
- Learning Transferable Visual Models From Natural Language Supervision. In Proceedings of the 38th International Conference on Machine Learning, pages 8748–8763. PMLR, 2021.
- Data Distillation: Towards Omni-Supervised Learning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2018.
- Data mining for improved cardiac care. ACM SIGKDD Explorations Newsletter, 8(1):3–10, 2006.
- Nikita Rom. planets_dataset, 2019.
- Simulation and the Monte Carlo Method. Wiley Publishing, 3rd edition, 2016.
- ImageNet Large Scale Visual Recognition Challenge. 2015.
- Data Distillation: A Survey. Transactions on Machine Learning Research, 2023. Survey Certification.
- A survey on generative adversarial networks for imbalance problems in computer vision tasks. Journal of Big Data, 8(1):27, 2021.
- Evaluating the classification of images from geoscience papers using small data. Applied Computing and Geosciences, 5:100018, 2020.
- Grad-CAM: Visual Explanations From Deep Networks via Gradient-Based Localization. In International Conference on Computer Vision (ICCV), 2017.
- One Explanation is Not Enough: Structured Attention Graphs for Image Classification. In Neural Information Processing Systems (NeurIPS), 2021.
- Meta-Weight-Net: Learning an Explicit Mapping For Sample Weighting, 2019. arXiv:1902.07379 [cs, stat].
- Deep inside convolutional networks: Visualising image classification models and saliency maps. In International Conference on Learning Representations (ICLR) Workshops, 2014.
- Improving Academic Performance Prediction by Dealing with Class Imbalance. In 2009 Ninth International Conference on Intelligent Systems Design and Applications, pages 878–883, 2009. ISSN: 2164-7151.
- A Deeper Look at Dataset Bias, pages 37–55. Springer International Publishing, Cham, 2017.
- Overcoming Bias in Pretrained Models by Manipulating the Finetuning Dataset. 2023.
- Annotation-efficient deep learning for automatic medical image segmentation. Nature Communications, 12(1):5915, 2021.
- Dataset distillation, 2020.
- Effective detection of sophisticated online banking fraud on extremely imbalanced data. World Wide Web, 16(4):449–475, 2013.
- A survey on deep semi-supervised learning. IEEE Transactions on Knowledge and Data Engineering, 35(9):8934–8954, 2023.
- Cost-sensitive learning by cost-proportionate example weighting. In Third IEEE International Conference on Data Mining, pages 435–442, 2003.
- S4L: Self-Supervised Semi-Supervised Learning. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2019.
- Scene Parsing through ADE20K Dataset. In 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 5122–5130, 2017. ISSN: 1063-6919.
- Places: A 10 Million Image Database for Scene Recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence, 40(6):1452–1464, 2018.
- Capturing Long-Tail Distributions of Object Subcategories. In 2014 IEEE Conference on Computer Vision and Pattern Recognition, pages 915–922, 2014.