Beyond Size and Class Balance: Alpha as a New Dataset Quality Metric for Deep Learning (2407.15724v2)
Abstract: In deep learning, achieving high performance on image classification tasks requires diverse training sets. However, the current best practice$\unicode{x2013}$maximizing dataset size and class balance$\unicode{x2013}$does not guarantee dataset diversity. We hypothesized that, for a given model architecture, model performance can be improved by maximizing diversity more directly. To test this hypothesis, we introduce a comprehensive framework of diversity measures from ecology that generalizes familiar quantities like Shannon entropy by accounting for similarities among images. (Size and class balance emerge as special cases.) Analyzing thousands of subsets from seven medical datasets showed that the best correlates of performance were not size or class balance but $A$$\unicode{x2013}$"big alpha"$\unicode{x2013}$a set of generalized entropy measures interpreted as the effective number of image-class pairs in the dataset, after accounting for image similarities. One of these, $A_0$, explained 67% of the variance in balanced accuracy, vs. 54% for class balance and just 39% for size. The best pair of measures was size-plus-$A_1$ (79%), which outperformed size-plus-class-balance (74%). Subsets with the largest $A_0$ performed up to 16% better than those with the largest size (median improvement, 8%). We propose maximizing $A$ as a way to improve deep learning performance in medical imaging.
- K. Spector-Bagdady, A. A. Armoundas, R. Arnaout, J. L. Hall, B. Yeager McSwain, J. W. Knowles, W. N. Price, D. B. Rawat, B. Riegel, T. Y. Wang, K. Wiley, M. K. Chung, and American Heart Association Advocacy Coordinating Committee, “Principles for Health Information Collection, Sharing, and Use: A Policy Statement From the American Heart Association,” Circulation, vol. 148, no. 13, pp. 1061–1069, Sep. 2023.
- R. Sachdeva, A. K. Armstrong, R. Arnaout, L. Grosse-Wortmann, B. K. Han, L. Mertens, R. A. Moore, L. J. Olivieri, A. Parthiban, and A. J. Powell, “Novel Techniques in Imaging Congenital Heart Disease: JACC Scientific Statement,” Journal of the American College of Cardiology, vol. 83, no. 1, pp. 63–81, Jan. 2024.
- D. Dey, R. Arnaout, S. Antani, A. Badano, L. Jacques, H. Li, T. Leiner, E. Margerrison, R. Samala, P. P. Sengupta, S. J. Shah, P. Slomka, M. C. Williams, W. P. Bandettini, and V. Sachdev, “Proceedings of the NHLBI Workshop on Artificial Intelligence in Cardiovascular Imaging: Translation to Patient Care,” JACC. Cardiovascular imaging, vol. 16, no. 9, pp. 1209–1223, Sep. 2023.
- P. Simard, B. Victorri, Y. LeCun, and J. Denker, “Tangent prop - a formalism for specifying selected invariances in an adaptive network,” in Advances in Neural Information Processing Systems, vol. 4. Morgan-Kaufmann, 1991. [Online]. Available: proceedings.neurips.cc/paper/1991/hash/65658fde58ab3c2b6e5132a39fae7cb9-Abstract.html
- L. Yaeger, R. Lyon, and B. Webb, “Effective training of a neural network character classifier for word recognition,” in Advances in Neural Information Processing Systems, M. Mozer, M. Jordan, and T. Petsche, Eds., vol. 9. MIT Press, 1996. [Online]. Available: https://proceedings.neurips.cc/paper_files/paper/1996/file/81e5f81db77c596492e6f1a5a792ed53-Paper.pdf
- C. Shorten and T. M. Khoshgoftaar, “A survey on Image Data Augmentation for Deep Learning,” Journal of Big Data, vol. 6, no. 1, p. 60, Dec. 2019. [Online]. Available: https://journalofbigdata.springeropen.com/articles/10.1186/s40537-019-0197-0
- Y. Lecun, L. Jackel, L. Bottou, C. Cortes, J. Denker, H. Drucker, I. Guyon, U. Muller, E. Sackinger, P. Simard, and V. Vapnik, “Learning algorithms for classification: A comparison on handwritten digit recognition,” in Neural networks, J. Oh, C. Kwon, and S. Cho, Eds. World Scientific, 1995, pp. 261–276.
- C. Athalye and R. Arnaout, “Domain-guided data augmentation for deep learning on medical imaging,” PloS One, vol. 18, no. 3, p. e0282532, 2023.
- A. RENYI, “On measures of entropy and information,” Proceedings of the fourth berkeley symposium on mathematical statistics and probability, 1961, 1961. [Online]. Available: cir.nii.ac.jp/crid/1572261550246171008
- C. E. Shannon, “A mathematical theory of communication,” Bell Syst. Tech. J., vol. 27, pp. 379–423, 1948.
- T. Leinster and C. A. Cobbold, “Measuring diversity: the importance of species similarity,” Ecology, vol. 93, no. 3, pp. 477–489, 2012.
- R. Reeve, T. Leinster, C. A. Cobbold, J. Thompson, N. Brummitt, S. N. Mitchell, and L. Matthews, “How to partition diversity,” 2016.
- T. Leinster, “Entropy and diversity: The axiomatic approach,” 2022, arXiv:2012.02113 [cs, math, q-bio]. [Online]. Available: http://arxiv.org/abs/2012.02113
- F. M. Megahed, Y.-J. Chen, A. Megahed, Y. Ong, N. Altman, and M. Krzywinski, “The class imbalance problem,” Nature Methods, vol. 18, no. 11, pp. 1270–1272, Nov. 2021, publisher: Nature Publishing Group. [Online]. Available: https://www.nature.com/articles/s41592-021-01302-4
- N. Japkowicz, “The class imbalance problem: Significance and strategies,” Proc. of the Int’l Conf. on artificial intelligence, vol. 56, pp. 111–117, Jun. 2000.
- Y. Ojima, S. Horiuchi, and F. Ishikawa, “Model-based data-complexity estimator for deep learning systems,” in 2021 IEEE International Conference on Artificial Intelligence Testing (AITest), 2021, pp. 1–8. [Online]. Available: ieeexplore.ieee.org/document/9564363
- H. Cho and S. Lee, “Data Quality Measures and Efficient Evaluation Algorithms for Large-Scale High-Dimensional Data,” Applied Sciences, vol. 11, no. 2, p. 472, Jan. 2021, number: 2 Publisher: Multidisciplinary Digital Publishing Institute. [Online]. Available: https://www.mdpi.com/2076-3417/11/2/472
- B. Chen, Y. S. Koh, and B. Halstead, “Measuring difficulty of learning using ensemble methods,” in Data Mining, L. A. F. Park, H. M. Gomes, M. Doborjeh, Y. L. Boo, Y. S. Koh, Y. Zhao, G. Williams, and S. Simoff, Eds. Springer Nature, 2022, pp. 28–42.
- P. Y. A. Paiva, C. C. Moreno, K. Smith-Miles, M. G. Valeriano, and A. C. Lorena, “Relating instance hardness to classification performance in a dataset: a visual approach,” Machine Learning, vol. 111, no. 8, pp. 3085–3123, Aug. 2022. [Online]. Available: https://doi.org/10.1007/s10994-022-06205-9
- M. Ivanovici, R.-M. Coliban, C. Hatfaludi, and I. E. Nicolae, “Color Image Complexity versus Over-Segmentation: A Preliminary Study on the Correlation between Complexity Measures and Number of Segments,” Journal of Imaging, vol. 6, no. 4, p. 16, Apr. 2020, number: 4 Publisher: Multidisciplinary Digital Publishing Institute. [Online]. Available: https://www.mdpi.com/2313-433X/6/4/16
- Z. Wan, Z. Wang, C. Chung, and Z. Wang, “A survey of data optimization for problems in computer vision datasets.” [Online]. Available: http://arxiv.org/abs/2210.11717
- Y. He, L. Xiao, and J. T. Zhou, “You only condense once: Two rules for pruning condensed datasets.” [Online]. Available: http://arxiv.org/abs/2310.14019
- M. Chen, B. Huang, J. Lu, B. Li, Y. Wang, M. Cheng, and W. Wang, “Dataset distillation via adversarial prediction matching.” [Online]. Available: http://arxiv.org/abs/2312.08912
- E. Chinn, R. Arora, R. Arnaout, and R. Arnaout, “ENRICHing medical imaging training sets enables more efficient machine learning,” Journal of the American Medical Informatics Association: JAMIA, vol. 30, no. 6, pp. 1079–1090, May 2023.
- U. Aggarwal, A. Popescu, and C. Hudelot, “Active learning for imbalanced datasets,” in 2020 IEEE Winter Conference on Applications of Computer Vision (WACV). IEEE, 2020, pp. 1417–1426. [Online]. Available: ieeexplore.ieee.org/document/9093475/
- X. Zhan, Q. Wang, K.-h. Huang, H. Xiong, D. Dou, and A. B. Chan, “A comparative survey of deep active learning.” [Online]. Available: http://arxiv.org/abs/2203.13450
- Y. LeCun, C. Cortes, C. Burges et al., “Mnist handwritten digit database, 1998,” 1998. [Online]. Available: http://yann.lecun.com/exdb/mnist
- L. Jost, “Partitioning diversity into independent alpha and beta components,” Ecology, vol. 88, no. 10, pp. 2427–2439, 2007.
- P. Nguyen, R. Arora, E. D. Hill, J. Braun, A. Morgan, L. M. Quintana, G. Mazzoni, G. R. Lee, R. Arnaout, and R. Arnaout, “greylock: A python package for measuring the composition of complex datasets,” 2023.
- R. Arnaout, L. Curran, Y. Zhao, J. C. Levine, E. Chinn, and A. J. Moon-Grady, “An ensemble of neural networks provides expert-level prenatal detection of complex congenital heart disease,” Nature Medicine, vol. 27, no. 5, pp. 882–891, May 2021, number: 5.
- M. O. Hill, “Diversity and evenness: A unifying notation and its consequences,” Ecology, vol. 54, no. 2, pp. 427–432, 1973. [Online]. Available: esajournals.onlinelibrary.wiley.com/doi/10.2307/1934352
- J. Yang, R. Shi, and B. Ni, “Medmnist classification decathlon: A lightweight automl benchmark for medical image analysis,” in IEEE 18th International Symposium on Biomedical Imaging (ISBI), 2021, pp. 191–195.
- J. Yang, R. Shi, D. Wei, Z. Liu, L. Zhao, B. Ke, H. Pfister, and B. Ni, “Medmnist v2-a large-scale lightweight benchmark for 2d and 3d biomedical image classification,” Scientific Data, vol. 10, no. 1, p. 41, 2023. [Online]. Available: http://arxiv.org/abs/2110.14795
- P. Bilic, P. Christ, H. B. Li, E. Vorontsov, A. Ben-Cohen, G. Kaissis, A. Szeskin, C. Jacobs, G. E. H. Mamani, G. Chartrand, F. Lohöfer, J. W. Holch, W. Sommer, F. Hofmann, A. Hostettler, N. Lev-Cohain, M. Drozdzal, M. M. Amitai, R. Vivantik, J. Sosna, I. Ezhov, A. Sekuboyina, F. Navarro, F. Kofler, J. C. Paetzold, S. Shit, X. Hu, J. Lipková, M. Rempfler, M. Piraud, J. Kirschke, B. Wiestler, Z. Zhang, C. Hülsemeyer, M. Beetz, F. Ettlinger, M. Antonelli, W. Bae, M. Bellver, L. Bi, H. Chen, G. Chlebus, E. B. Dam, Q. Dou, C.-W. Fu, B. Georgescu, X. Giró-i Nieto, F. Gruen, X. Han, P.-A. Heng, J. Hesser, J. H. Moltz, C. Igel, F. Isensee, P. Jäger, F. Jia, K. C. Kaluva, M. Khened, I. Kim, J.-H. Kim, S. Kim, S. Kohl, T. Konopczynski, A. Kori, G. Krishnamurthi, F. Li, H. Li, J. Li, X. Li, J. Lowengrub, J. Ma, K. Maier-Hein, K.-K. Maninis, H. Meine, D. Merhof, A. Pai, M. Perslev, J. Petersen, J. Pont-Tuset, J. Qi, X. Qi, O. Rippel, K. Roth, I. Sarasua, A. Schenk, Z. Shen, J. Torres, C. Wachinger, C. Wang, L. Weninger, J. Wu, D. Xu, X. Yang, S. C.-H. Yu, Y. Yuan, M. Yu, L. Zhang, J. Cardoso, S. Bakas, R. Braren, V. Heinemann, C. Pal, A. Tang, S. Kadoury, L. Soler, B. van Ginneken, H. Greenspan, L. Joskowicz, and B. Menze, “The Liver Tumor Segmentation Benchmark (LiTS),” Medical Image Analysis, vol. 84, p. 102680, Feb. 2023, arXiv:1901.04056 [cs]. [Online]. Available: http://arxiv.org/abs/1901.04056
- E. Chinn, R. Arora, R. Arnaout, and R. Arnaout, “Enrich: Exploiting image similarity to maximize efficient machine learning in medical imaging,” Journal of the American Medical Informatics Association, vol. 30, no. 6, pp. 1079–1090, 04 2023. [Online]. Available: medrxiv.org/content/early/2021/05/25/2021.05.22.21257645
- A. Madani, R. Arnaout, M. Mofrad, and R. Arnaout, “Fast and accurate view classification of echocardiograms using deep learning,” npj Digital Medicine, vol. 1, no. 1, p. 6, 2018. [Online]. Available: nature.com/articles/s41746-017-0013-1
- M. E. H. Chowdhury, T. Rahman, A. Khandakar, R. Mazhar, M. A. Kadir, Z. B. Mahbub, K. R. Islam, M. S. Khan, A. Iqbal, N. Al-Emadi, M. B. I. Reaz, and T. I. Islam, “Can AI help in screening Viral and COVID-19 pneumonia?” IEEE Access, vol. 8, pp. 132 665–132 676, 2020, arXiv:2003.13145 [cs]. [Online]. Available: http://arxiv.org/abs/2003.13145
- T. Rahman, A. Khandakar, Y. Qiblawey, A. Tahir, S. Kiranyaz, S. B. Abul Kashem, M. T. Islam, S. Al Maadeed, S. M. Zughaier, M. S. Khan, and M. E. H. Chowdhury, “Exploring the effect of image enhancement techniques on COVID-19 detection using chest X-ray images,” Computers in Biology and Medicine, vol. 132, p. 104319, May 2021. [Online]. Available: sciencedirect.com/science/article/pii/S001048252100113X
- K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” 2015. [Online]. Available: http://arxiv.org/abs/1512.03385
- A. Paszke, S. Gross, F. Massa, A. Lerer, J. Bradbury, G. Chanan, T. Killeen, Z. Lin, N. Gimelshein, L. Antiga, A. Desmaison, A. Köpf, E. Yang, Z. DeVito, M. Raison, A. Tejani, S. Chilamkurthy, B. Steiner, L. Fang, J. Bai, and S. Chintala, “Pytorch: An imperative style, high-performance deep learning library,” 2019.
- F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, D. Cournapeau, M. Brucher, M. Perrot, and E. Duchesnay, “Scikit-learn: Machine learning in Python,” Journal of Machine Learning Research, vol. 12, pp. 2825–2830, 2011.
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Top Community Prompts
Collections
Sign up for free to add this paper to one or more collections.