Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
169 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
45 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

How much data do you need? Part 2: Predicting DL class specific training dataset sizes (2403.06311v1)

Published 10 Mar 2024 in cs.LG and stat.ML

Abstract: This paper targets the question of predicting machine learning classification model performance, when taking into account the number of training examples per class and not just the overall number of training examples. This leads to the a combinatorial question, which combinations of number of training examples per class should be considered, given a fixed overall training dataset size. In order to solve this question, an algorithm is suggested which is motivated from special cases of space filling design of experiments. The resulting data are modeled using models like powerlaw curves and similar models, extended like generalized linear models i.e. by replacing the overall training dataset size by a parametrized linear combination of the number of training examples per label class. The proposed algorithm has been applied on the CIFAR10 and the EMNIST datasets.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (24)
  1. J. Hestness, S. Narang, N. Ardalani, G. Diamos, H. Jun, H. Kianinejad, M. M. A. Patwary, Y. Yang, and Y. Zhou, “Deep learning scaling is predictable, empirically,” 2017.
  2. R. Mahmood, J. Lucas, D. Acuna, D. Li, J. Philion, J. M. Alvarez, Z. Yu, S. Fidler, and M. T. Law, “How much more data do i need? estimating requirements for downstream tasks,” 2022.
  3. J. Cho, K. Lee, E. Shin, G. Choy, and S. Do, “How much data is needed to train a medical image deep learning system to achieve necessary high accuracy?,” 2016.
  4. R. Figueroa, Q. Zeng-Treitler, S. Kandula, and L. Ngo, “Predicting sample size required for classification performance,” BMC medical informatics and decision making, vol. 12, p. 8, 02 2012.
  5. B. Sorscher, R. Geirhos, S. Shekhar, S. Ganguli, and A. S. Morcos, “Beyond neural scaling laws: beating power law scaling via data pruning,” 2023.
  6. C. Sun, A. Shrivastava, S. Singh, and A. Gupta, “Revisiting unreasonable effectiveness of data in deep learning era,” 2017.
  7. J. Hoffmann, S. Borgeaud, A. Mensch, E. Buchatskaya, T. Cai, E. Rutherford, D. de Las Casas, L. A. Hendricks, J. Welbl, A. Clark, T. Hennigan, E. Noland, K. Millican, G. van den Driessche, B. Damoc, A. Guy, S. Osindero, K. Simonyan, E. Elsen, J. W. Rae, O. Vinyals, and L. Sifre, “Training compute-optimal large language models,” 2022.
  8. I. Alabdulmohsin, B. Neyshabur, and X. Zhai, “Revisiting neural scaling laws in language and vision,” 2022.
  9. M. Hutter, “Learning curve theory,” 2021.
  10. MIT Press, 2016.
  11. C. M. Bishop, Pattern Recognition and Machine Learning (Information Science and Statistics). Berlin, Heidelberg: Springer-Verlag, 2006.
  12. C. Gomes, M. Claeys-Bruno, and M. Sergent, “Space-filling designs for mixtures,” Chemometrics and Intelligent Laboratory Systems, vol. 174, pp. 111–127, Mar. 2018.
  13. R. Fisher, The design of experiments. 1935. Edinburgh: Oliver and Boyd, 1935.
  14. Wiley Series in Probability and Statistics, Wiley, 2009.
  15. P. Goos and B. Jones, Optimal Design of Experiments: A Case-Study Approach. 06 2011.
  16. NIST/SEMATECH, e-Handbook of Statistical Methods. 2012.
  17. V. Fedorov, Theory of Optimal Experiments. Cellular Neurobiology, Academic Press, 1972.
  18. P. Virtanen, R. Gommers, T. E. Oliphant, M. Haberland, T. Reddy, D. Cournapeau, E. Burovski, P. Peterson, W. Weckesser, J. Bright, S. J. van der Walt, M. Brett, J. Wilson, K. J. Millman, N. Mayorov, A. R. J. Nelson, E. Jones, R. Kern, E. Larson, C. J. Carey, İ. Polat, Y. Feng, E. W. Moore, J. VanderPlas, D. Laxalde, J. Perktold, R. Cimrman, I. Henriksen, E. A. Quintero, C. R. Harris, A. M. Archibald, A. H. Ribeiro, F. Pedregosa, P. van Mulbregt, and SciPy 1.0 Contributors, “SciPy 1.0: Fundamental Algorithms for Scientific Computing in Python,” Nature Methods, vol. 17, pp. 261–272, 2020.
  19. A. Krizhevsky, V. Nair, and G. Hinton, “Cifar-10 (canadian institute for advanced research),”
  20. K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” 2015.
  21. A. Paszke, S. Gross, F. Massa, A. Lerer, J. Bradbury, G. Chanan, T. Killeen, Z. Lin, N. Gimelshein, L. Antiga, A. Desmaison, A. Köpf, E. Z. Yang, Z. DeVito, M. Raison, A. Tejani, S. Chilamkurthy, B. Steiner, L. Fang, J. Bai, and S. Chintala, “Pytorch: An imperative style, high-performance deep learning library,” CoRR, vol. abs/1912.01703, 2019.
  22. P. McCullagh and J. A. Nelder, Generalized Linear Models. Chapman and Hall, 1989.
  23. G. Cohen, S. Afshar, J. Tapson, and A. van Schaik, “Emnist:an extension of mnist to handwritten letters,” 2017.
  24. A. Howard, M. Sandler, G. Chu, L.-C. Chen, B. Chen, M. Tan, W. W. Y. Z. RuomingPang2, V. Vasudevan, Q. Le, and H. Adam, “Searching for mobilenetv3,” 2019.
Citations (2)

Summary

We haven't generated a summary for this paper yet.

X Twitter Logo Streamline Icon: https://streamlinehq.com