Papers
Topics
Authors
Recent
Assistant
AI Research Assistant
Well-researched responses based on relevant abstracts and paper content.
Custom Instructions Pro
Preferences or requirements that you'd like Emergent Mind to consider when generating responses.
Gemini 2.5 Flash
Gemini 2.5 Flash 149 tok/s
Gemini 2.5 Pro 46 tok/s Pro
GPT-5 Medium 25 tok/s Pro
GPT-5 High 30 tok/s Pro
GPT-4o 112 tok/s Pro
Kimi K2 205 tok/s Pro
GPT OSS 120B 434 tok/s Pro
Claude Sonnet 4.5 38 tok/s Pro
2000 character limit reached

Beyond Size and Class Balance: Alpha as a New Dataset Quality Metric for Deep Learning (2407.15724v2)

Published 22 Jul 2024 in cs.CV and cs.LG

Abstract: In deep learning, achieving high performance on image classification tasks requires diverse training sets. However, the current best practice$\unicode{x2013}$maximizing dataset size and class balance$\unicode{x2013}$does not guarantee dataset diversity. We hypothesized that, for a given model architecture, model performance can be improved by maximizing diversity more directly. To test this hypothesis, we introduce a comprehensive framework of diversity measures from ecology that generalizes familiar quantities like Shannon entropy by accounting for similarities among images. (Size and class balance emerge as special cases.) Analyzing thousands of subsets from seven medical datasets showed that the best correlates of performance were not size or class balance but $A$$\unicode{x2013}$"big alpha"$\unicode{x2013}$a set of generalized entropy measures interpreted as the effective number of image-class pairs in the dataset, after accounting for image similarities. One of these, $A_0$, explained 67% of the variance in balanced accuracy, vs. 54% for class balance and just 39% for size. The best pair of measures was size-plus-$A_1$ (79%), which outperformed size-plus-class-balance (74%). Subsets with the largest $A_0$ performed up to 16% better than those with the largest size (median improvement, 8%). We propose maximizing $A$ as a way to improve deep learning performance in medical imaging.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (41)
  1. K. Spector-Bagdady, A. A. Armoundas, R. Arnaout, J. L. Hall, B. Yeager McSwain, J. W. Knowles, W. N. Price, D. B. Rawat, B. Riegel, T. Y. Wang, K. Wiley, M. K. Chung, and American Heart Association Advocacy Coordinating Committee, “Principles for Health Information Collection, Sharing, and Use: A Policy Statement From the American Heart Association,” Circulation, vol. 148, no. 13, pp. 1061–1069, Sep. 2023.
  2. R. Sachdeva, A. K. Armstrong, R. Arnaout, L. Grosse-Wortmann, B. K. Han, L. Mertens, R. A. Moore, L. J. Olivieri, A. Parthiban, and A. J. Powell, “Novel Techniques in Imaging Congenital Heart Disease: JACC Scientific Statement,” Journal of the American College of Cardiology, vol. 83, no. 1, pp. 63–81, Jan. 2024.
  3. D. Dey, R. Arnaout, S. Antani, A. Badano, L. Jacques, H. Li, T. Leiner, E. Margerrison, R. Samala, P. P. Sengupta, S. J. Shah, P. Slomka, M. C. Williams, W. P. Bandettini, and V. Sachdev, “Proceedings of the NHLBI Workshop on Artificial Intelligence in Cardiovascular Imaging: Translation to Patient Care,” JACC. Cardiovascular imaging, vol. 16, no. 9, pp. 1209–1223, Sep. 2023.
  4. P. Simard, B. Victorri, Y. LeCun, and J. Denker, “Tangent prop - a formalism for specifying selected invariances in an adaptive network,” in Advances in Neural Information Processing Systems, vol. 4.   Morgan-Kaufmann, 1991. [Online]. Available: proceedings.neurips.cc/paper/1991/hash/65658fde58ab3c2b6e5132a39fae7cb9-Abstract.html
  5. L. Yaeger, R. Lyon, and B. Webb, “Effective training of a neural network character classifier for word recognition,” in Advances in Neural Information Processing Systems, M. Mozer, M. Jordan, and T. Petsche, Eds., vol. 9.   MIT Press, 1996. [Online]. Available: https://proceedings.neurips.cc/paper_files/paper/1996/file/81e5f81db77c596492e6f1a5a792ed53-Paper.pdf
  6. C. Shorten and T. M. Khoshgoftaar, “A survey on Image Data Augmentation for Deep Learning,” Journal of Big Data, vol. 6, no. 1, p. 60, Dec. 2019. [Online]. Available: https://journalofbigdata.springeropen.com/articles/10.1186/s40537-019-0197-0
  7. Y. Lecun, L. Jackel, L. Bottou, C. Cortes, J. Denker, H. Drucker, I. Guyon, U. Muller, E. Sackinger, P. Simard, and V. Vapnik, “Learning algorithms for classification: A comparison on handwritten digit recognition,” in Neural networks, J. Oh, C. Kwon, and S. Cho, Eds.   World Scientific, 1995, pp. 261–276.
  8. C. Athalye and R. Arnaout, “Domain-guided data augmentation for deep learning on medical imaging,” PloS One, vol. 18, no. 3, p. e0282532, 2023.
  9. A. RENYI, “On measures of entropy and information,” Proceedings of the fourth berkeley symposium on mathematical statistics and probability, 1961, 1961. [Online]. Available: cir.nii.ac.jp/crid/1572261550246171008
  10. C. E. Shannon, “A mathematical theory of communication,” Bell Syst. Tech. J., vol. 27, pp. 379–423, 1948.
  11. T. Leinster and C. A. Cobbold, “Measuring diversity: the importance of species similarity,” Ecology, vol. 93, no. 3, pp. 477–489, 2012.
  12. R. Reeve, T. Leinster, C. A. Cobbold, J. Thompson, N. Brummitt, S. N. Mitchell, and L. Matthews, “How to partition diversity,” 2016.
  13. T. Leinster, “Entropy and diversity: The axiomatic approach,” 2022, arXiv:2012.02113 [cs, math, q-bio]. [Online]. Available: http://arxiv.org/abs/2012.02113
  14. F. M. Megahed, Y.-J. Chen, A. Megahed, Y. Ong, N. Altman, and M. Krzywinski, “The class imbalance problem,” Nature Methods, vol. 18, no. 11, pp. 1270–1272, Nov. 2021, publisher: Nature Publishing Group. [Online]. Available: https://www.nature.com/articles/s41592-021-01302-4
  15. N. Japkowicz, “The class imbalance problem: Significance and strategies,” Proc. of the Int’l Conf. on artificial intelligence, vol. 56, pp. 111–117, Jun. 2000.
  16. Y. Ojima, S. Horiuchi, and F. Ishikawa, “Model-based data-complexity estimator for deep learning systems,” in 2021 IEEE International Conference on Artificial Intelligence Testing (AITest), 2021, pp. 1–8. [Online]. Available: ieeexplore.ieee.org/document/9564363
  17. H. Cho and S. Lee, “Data Quality Measures and Efficient Evaluation Algorithms for Large-Scale High-Dimensional Data,” Applied Sciences, vol. 11, no. 2, p. 472, Jan. 2021, number: 2 Publisher: Multidisciplinary Digital Publishing Institute. [Online]. Available: https://www.mdpi.com/2076-3417/11/2/472
  18. B. Chen, Y. S. Koh, and B. Halstead, “Measuring difficulty of learning using ensemble methods,” in Data Mining, L. A. F. Park, H. M. Gomes, M. Doborjeh, Y. L. Boo, Y. S. Koh, Y. Zhao, G. Williams, and S. Simoff, Eds.   Springer Nature, 2022, pp. 28–42.
  19. P. Y. A. Paiva, C. C. Moreno, K. Smith-Miles, M. G. Valeriano, and A. C. Lorena, “Relating instance hardness to classification performance in a dataset: a visual approach,” Machine Learning, vol. 111, no. 8, pp. 3085–3123, Aug. 2022. [Online]. Available: https://doi.org/10.1007/s10994-022-06205-9
  20. M. Ivanovici, R.-M. Coliban, C. Hatfaludi, and I. E. Nicolae, “Color Image Complexity versus Over-Segmentation: A Preliminary Study on the Correlation between Complexity Measures and Number of Segments,” Journal of Imaging, vol. 6, no. 4, p. 16, Apr. 2020, number: 4 Publisher: Multidisciplinary Digital Publishing Institute. [Online]. Available: https://www.mdpi.com/2313-433X/6/4/16
  21. Z. Wan, Z. Wang, C. Chung, and Z. Wang, “A survey of data optimization for problems in computer vision datasets.” [Online]. Available: http://arxiv.org/abs/2210.11717
  22. Y. He, L. Xiao, and J. T. Zhou, “You only condense once: Two rules for pruning condensed datasets.” [Online]. Available: http://arxiv.org/abs/2310.14019
  23. M. Chen, B. Huang, J. Lu, B. Li, Y. Wang, M. Cheng, and W. Wang, “Dataset distillation via adversarial prediction matching.” [Online]. Available: http://arxiv.org/abs/2312.08912
  24. E. Chinn, R. Arora, R. Arnaout, and R. Arnaout, “ENRICHing medical imaging training sets enables more efficient machine learning,” Journal of the American Medical Informatics Association: JAMIA, vol. 30, no. 6, pp. 1079–1090, May 2023.
  25. U. Aggarwal, A. Popescu, and C. Hudelot, “Active learning for imbalanced datasets,” in 2020 IEEE Winter Conference on Applications of Computer Vision (WACV).   IEEE, 2020, pp. 1417–1426. [Online]. Available: ieeexplore.ieee.org/document/9093475/
  26. X. Zhan, Q. Wang, K.-h. Huang, H. Xiong, D. Dou, and A. B. Chan, “A comparative survey of deep active learning.” [Online]. Available: http://arxiv.org/abs/2203.13450
  27. Y. LeCun, C. Cortes, C. Burges et al., “Mnist handwritten digit database, 1998,” 1998. [Online]. Available: http://yann.lecun.com/exdb/mnist
  28. L. Jost, “Partitioning diversity into independent alpha and beta components,” Ecology, vol. 88, no. 10, pp. 2427–2439, 2007.
  29. P. Nguyen, R. Arora, E. D. Hill, J. Braun, A. Morgan, L. M. Quintana, G. Mazzoni, G. R. Lee, R. Arnaout, and R. Arnaout, “greylock: A python package for measuring the composition of complex datasets,” 2023.
  30. R. Arnaout, L. Curran, Y. Zhao, J. C. Levine, E. Chinn, and A. J. Moon-Grady, “An ensemble of neural networks provides expert-level prenatal detection of complex congenital heart disease,” Nature Medicine, vol. 27, no. 5, pp. 882–891, May 2021, number: 5.
  31. M. O. Hill, “Diversity and evenness: A unifying notation and its consequences,” Ecology, vol. 54, no. 2, pp. 427–432, 1973. [Online]. Available: esajournals.onlinelibrary.wiley.com/doi/10.2307/1934352
  32. J. Yang, R. Shi, and B. Ni, “Medmnist classification decathlon: A lightweight automl benchmark for medical image analysis,” in IEEE 18th International Symposium on Biomedical Imaging (ISBI), 2021, pp. 191–195.
  33. J. Yang, R. Shi, D. Wei, Z. Liu, L. Zhao, B. Ke, H. Pfister, and B. Ni, “Medmnist v2-a large-scale lightweight benchmark for 2d and 3d biomedical image classification,” Scientific Data, vol. 10, no. 1, p. 41, 2023. [Online]. Available: http://arxiv.org/abs/2110.14795
  34. P. Bilic, P. Christ, H. B. Li, E. Vorontsov, A. Ben-Cohen, G. Kaissis, A. Szeskin, C. Jacobs, G. E. H. Mamani, G. Chartrand, F. Lohöfer, J. W. Holch, W. Sommer, F. Hofmann, A. Hostettler, N. Lev-Cohain, M. Drozdzal, M. M. Amitai, R. Vivantik, J. Sosna, I. Ezhov, A. Sekuboyina, F. Navarro, F. Kofler, J. C. Paetzold, S. Shit, X. Hu, J. Lipková, M. Rempfler, M. Piraud, J. Kirschke, B. Wiestler, Z. Zhang, C. Hülsemeyer, M. Beetz, F. Ettlinger, M. Antonelli, W. Bae, M. Bellver, L. Bi, H. Chen, G. Chlebus, E. B. Dam, Q. Dou, C.-W. Fu, B. Georgescu, X. Giró-i Nieto, F. Gruen, X. Han, P.-A. Heng, J. Hesser, J. H. Moltz, C. Igel, F. Isensee, P. Jäger, F. Jia, K. C. Kaluva, M. Khened, I. Kim, J.-H. Kim, S. Kim, S. Kohl, T. Konopczynski, A. Kori, G. Krishnamurthi, F. Li, H. Li, J. Li, X. Li, J. Lowengrub, J. Ma, K. Maier-Hein, K.-K. Maninis, H. Meine, D. Merhof, A. Pai, M. Perslev, J. Petersen, J. Pont-Tuset, J. Qi, X. Qi, O. Rippel, K. Roth, I. Sarasua, A. Schenk, Z. Shen, J. Torres, C. Wachinger, C. Wang, L. Weninger, J. Wu, D. Xu, X. Yang, S. C.-H. Yu, Y. Yuan, M. Yu, L. Zhang, J. Cardoso, S. Bakas, R. Braren, V. Heinemann, C. Pal, A. Tang, S. Kadoury, L. Soler, B. van Ginneken, H. Greenspan, L. Joskowicz, and B. Menze, “The Liver Tumor Segmentation Benchmark (LiTS),” Medical Image Analysis, vol. 84, p. 102680, Feb. 2023, arXiv:1901.04056 [cs]. [Online]. Available: http://arxiv.org/abs/1901.04056
  35. E. Chinn, R. Arora, R. Arnaout, and R. Arnaout, “Enrich: Exploiting image similarity to maximize efficient machine learning in medical imaging,” Journal of the American Medical Informatics Association, vol. 30, no. 6, pp. 1079–1090, 04 2023. [Online]. Available: medrxiv.org/content/early/2021/05/25/2021.05.22.21257645
  36. A. Madani, R. Arnaout, M. Mofrad, and R. Arnaout, “Fast and accurate view classification of echocardiograms using deep learning,” npj Digital Medicine, vol. 1, no. 1, p. 6, 2018. [Online]. Available: nature.com/articles/s41746-017-0013-1
  37. M. E. H. Chowdhury, T. Rahman, A. Khandakar, R. Mazhar, M. A. Kadir, Z. B. Mahbub, K. R. Islam, M. S. Khan, A. Iqbal, N. Al-Emadi, M. B. I. Reaz, and T. I. Islam, “Can AI help in screening Viral and COVID-19 pneumonia?” IEEE Access, vol. 8, pp. 132 665–132 676, 2020, arXiv:2003.13145 [cs]. [Online]. Available: http://arxiv.org/abs/2003.13145
  38. T. Rahman, A. Khandakar, Y. Qiblawey, A. Tahir, S. Kiranyaz, S. B. Abul Kashem, M. T. Islam, S. Al Maadeed, S. M. Zughaier, M. S. Khan, and M. E. H. Chowdhury, “Exploring the effect of image enhancement techniques on COVID-19 detection using chest X-ray images,” Computers in Biology and Medicine, vol. 132, p. 104319, May 2021. [Online]. Available: sciencedirect.com/science/article/pii/S001048252100113X
  39. K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” 2015. [Online]. Available: http://arxiv.org/abs/1512.03385
  40. A. Paszke, S. Gross, F. Massa, A. Lerer, J. Bradbury, G. Chanan, T. Killeen, Z. Lin, N. Gimelshein, L. Antiga, A. Desmaison, A. Köpf, E. Yang, Z. DeVito, M. Raison, A. Tejani, S. Chilamkurthy, B. Steiner, L. Fang, J. Bai, and S. Chintala, “Pytorch: An imperative style, high-performance deep learning library,” 2019.
  41. F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, D. Cournapeau, M. Brucher, M. Perrot, and E. Duchesnay, “Scikit-learn: Machine learning in Python,” Journal of Machine Learning Research, vol. 12, pp. 2825–2830, 2011.

Summary

We haven't generated a summary for this paper yet.

Dice Question Streamline Icon: https://streamlinehq.com

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Lightbulb Streamline Icon: https://streamlinehq.com

Continue Learning

We haven't generated follow-up questions for this paper yet.

List To Do Tasks Checklist Streamline Icon: https://streamlinehq.com

Collections

Sign up for free to add this paper to one or more collections.