Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
119 tokens/sec
GPT-4o
56 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Restoring balance: principled under/oversampling of data for optimal classification (2405.09535v1)

Published 15 May 2024 in cond-mat.dis-nn and cs.LG

Abstract: Class imbalance in real-world data poses a common bottleneck for machine learning tasks, since achieving good generalization on under-represented examples is often challenging. Mitigation strategies, such as under or oversampling the data depending on their abundances, are routinely proposed and tested empirically, but how they should adapt to the data statistics remains poorly understood. In this work, we determine exact analytical expressions of the generalization curves in the high-dimensional regime for linear classifiers (Support Vector Machines). We also provide a sharp prediction of the effects of under/oversampling strategies depending on class imbalance, first and second moments of the data, and the metrics of performance considered. We show that mixed strategies involving under and oversampling of data lead to performance improvement. Through numerical experiments, we show the relevance of our theoretical predictions on real datasets, on deeper architectures and with sampling strategies based on unsupervised probabilistic models.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (80)
  1. Out-of-equilibrium dynamical mean-field equations for the perceptron model. Journal of Physics A: Mathematical and Theoretical, 51(8):085002, jan 2018. doi: 10.1088/1751-8121/aaa68d. URL https://dx.doi.org/10.1088/1751-8121/aaa68d.
  2. Explaining the effects of non-convergent MCMC in the training of energy-based models. In Krause, A., Brunskill, E., Cho, K., Engelhardt, B., Sabato, S., and Scarlett, J. (eds.), Proceedings of the 40th International Conference on Machine Learning, volume 202 of Proceedings of Machine Learning Research, pp.  322–336. PMLR, 23–29 Jul 2023. URL https://proceedings.mlr.press/v202/agoritsas23a.html.
  3. Random features and polynomial rules, 2024.
  4. Generative oversampling for imbalanced data via majority-guided vae. In Ruiz, F., Dy, J., and van de Meent, J.-W. (eds.), Proceedings of The 26th International Conference on Artificial Intelligence and Statistics, volume 206 of Proceedings of Machine Learning Research, pp.  3315–3330. PMLR, 25–27 Apr 2023. URL https://proceedings.mlr.press/v206/ai23a.html.
  5. Local kernel renormalization as a mechanism for feature learning in overparametrized convolutional neural networks, 2023.
  6. Applying support vector machines to imbalanced datasets. In Boulicaut, J.-F., Esposito, F., Giannotti, F., and Pedreschi, D. (eds.), Machine Learning: ECML 2004, pp.  39–50, Berlin, Heidelberg, 2004. Springer Berlin Heidelberg. ISBN 978-3-540-30115-8. doi: 10.1007/978-3-540-30115-8_7.
  7. Long-tailed recognition via weight balancing. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp.  6897–6907, June 2022.
  8. An improved algorithm for neural network classification of imbalanced training sets. IEEE Transactions on Neural Networks, 4(6):962–969, 1993. doi: 10.1109/72.286891.
  9. Learning peptide properties with positive examples only, 2023. URL https://www.biorxiv.org/content/early/2023/06/05/2023.06.01.543289.
  10. Resolution of similar patterns in a solvable model of unsupervised deep learning with structured data. Chaos, Solitons & Fractals, 182:114848, 2024. ISSN 0960-0779. doi: https://doi.org/10.1016/j.chaos.2024.114848. URL https://www.sciencedirect.com/science/article/pii/S0960077924004004.
  11. A study of the behavior of several methods for balancing machine learning training data. SIGKDD Explor. Newsl., 6(1):20–29, jun 2004. ISSN 1931-0145. doi: 10.1145/1007730.1007735. URL https://doi.org/10.1145/1007730.1007735.
  12. Spectrum dependent learning curves in kernel regression and wide neural networks. In III, H. D. and Singh, A. (eds.), Proceedings of the 37th International Conference on Machine Learning, volume 119 of Proceedings of Machine Learning Research, pp.  1024–1034. PMLR, 13–18 Jul 2020. URL https://proceedings.mlr.press/v119/bordelon20a.html.
  13. A systematic study of the class imbalance problem in convolutional neural networks. Neural Networks, 106:249–259, 2018. ISSN 0893-6080. doi: 10.1016/j.neunet.2018.07.011. URL https://www.sciencedirect.com/science/article/pii/S0893608018302107.
  14. Fast and functional structured data generator. In ICML 2023 Workshop on Structured Probabilistic Inference & Generative Modeling, 2023. URL https://openreview.net/forum?id=uXkfPvjYeM.
  15. Why does throwing away data improve worst-group error? In Krause, A., Brunskill, E., Cho, K., Engelhardt, B., Sabato, S., and Scarlett, J. (eds.), Proceedings of the 40th International Conference on Machine Learning, volume 202 of Proceedings of Machine Learning Research, pp.  4144–4188. PMLR, 23–29 Jul 2023. URL https://proceedings.mlr.press/v202/chaudhuri23a.html.
  16. SMOTE: synthetic minority over-sampling technique. Journal of artificial intelligence research, 16:321–357, 2002. doi: 10.1613/jair.953.
  17. Computationally predicting protein-RNA interactions using only positive and unlabeled examples. Journal of Bioinformatics and Computational Biology, 13(03):1541005, 2015. doi: 10.1142/S021972001541005X. URL https://doi.org/10.1142/S021972001541005X.
  18. Classification and geometry of general perceptual manifolds. Phys. Rev. X, 8:031003, Jul 2018. doi: 10.1103/PhysRevX.8.031003. URL https://link.aps.org/doi/10.1103/PhysRevX.8.031003.
  19. Bayes-optimal learning of deep random networks of extensive-width, 2023.
  20. Asymptotics of feature learning in two-layer networks after one gradient-step, 2024.
  21. Perceptron beyond the limit of capacity. J. Phys. France, 50(2):121–134, 1989. doi: 10.1051/jphys:01989005002012100. URL https://doi.org/10.1051/jphys:01989005002012100.
  22. Statistical mechanics of support vector networks. Phys. Rev. Lett., 82:2975–2978, 04 1999. doi: 10.1103/PhysRevLett.82.2975. URL https://link.aps.org/doi/10.1103/PhysRevLett.82.2975.
  23. Effective data generation for imbalanced learning using conditional generative adversarial networks. Expert Systems with Applications, 91:464–471, 2018. ISSN 0957-4174. doi: https://doi.org/10.1016/j.eswa.2017.09.030. URL https://www.sciencedirect.com/science/article/pii/S0957417417306346.
  24. Statistical Mechanics of Learning. Cambridge University Press, 2001. doi: 10.1017/CBO9781139164542.
  25. Learning from Imbalanced Data Sets. Springer Cham, 2018. doi: 10.1007/978-3-319-98074-4.
  26. A comprehensive data level analysis for cancer diagnosis on imbalanced data. Journal of Biomedical Informatics, 90:103089, 2019. ISSN 1532-0464. doi: 10.1016/j.jbi.2018.12.003. URL https://www.sciencedirect.com/science/article/pii/S1532046418302302.
  27. A theoretical analysis of the learning dynamics under class imbalance, 2023.
  28. Universality of the SAT-UNSAT (jamming) threshold in non-convex continuous constraint satisfaction problems. SciPost Phys., 2:019, 2017. doi: 10.21468/SciPostPhys.2.3.019. URL https://scipost.org/10.21468/SciPostPhys.2.3.019.
  29. Critical jammed phase of the linear perceptron. Phys. Rev. Lett., 123:115702, Sep 2019. doi: 10.1103/PhysRevLett.123.115702. URL https://link.aps.org/doi/10.1103/PhysRevLett.123.115702.
  30. Gardner, E. The space of interactions in neural network models. Journal of Physics A: Mathematical and General, 21(1):257, jan 1988. doi: 10.1088/0305-4470/21/1/030. URL https://dx.doi.org/10.1088/0305-4470/21/1/030.
  31. Generalisation error in learning with random features and the hidden manifold model. In III, H. D. and Singh, A. (eds.), Proceedings of the 37th International Conference on Machine Learning, volume 119 of Proceedings of Machine Learning Research, pp.  3452–3462. PMLR, 13–18 Jul 2020. URL https://proceedings.mlr.press/v119/gerace20a.html.
  32. The class imbalance problem in deep learning. Machine Learning, Dec 2022. ISSN 1573-0565. URL https://doi.org/10.1007/s10994-022-06268-8.
  33. Modeling the influence of data structure on learning in neural networks: The hidden manifold model. Phys. Rev. X, 10:041044, Dec 2020. doi: 10.1103/PhysRevX.10.041044. URL https://link.aps.org/doi/10.1103/PhysRevX.10.041044.
  34. Imbalanced Learning: Foundations, Algorithms, and Applications. John Wiley & Sons, Ltd, 2013. ISBN 9781118646106. doi: 10.1002/9781118646106. URL https://onlinelibrary.wiley.com/doi/abs/10.1002/9781118646106.
  35. Adasyn: Adaptive synthetic sampling approach for imbalanced learning. In 2008 IEEE International Joint Conference on Neural Networks (IEEE World Congress on Computational Intelligence), pp.  1322–1328, 2008. doi: 10.1109/IJCNN.2008.4633969.
  36. Neural tangent kernel: Convergence and generalization in neural networks. In Bengio, S., Wallach, H., Larochelle, H., Grauman, K., Cesa-Bianchi, N., and Garnett, R. (eds.), Advances in Neural Information Processing Systems, volume 31. Curran Associates, Inc., 2018. URL https://proceedings.neurips.cc/paper/2018/file/5a4be1fa34e62bb8a6ec6b91d2462f5a-Paper.pdf.
  37. The class imbalance problem: A systematic study. Intelligent Data Analysis, 6:429–449, 2002. ISSN 1571-4128. doi: 10.3233/IDA-2002-6504. URL https://doi.org/10.3233/IDA-2002-6504. 5.
  38. Survey on deep learning with class imbalance. Journal of Big Data, 6(1):27, Mar 2019. ISSN 2196-1115. doi: 10.1186/s40537-019-0192-5. URL https://doi.org/10.1186/s40537-019-0192-5.
  39. SciPy: open source scientific tools for Python, 2001. URL http://www.scipy.org.
  40. Speech and Language Processing: An Introduction to Natural Language Processing, Computational Linguistics, and Speech Recognition. Prentice Hall series in artificial intelligence. Pearson Prentice Hall, 2009. ISBN 9780131873216. URL https://books.google.fr/books?id=fZmj5UNK8AQC.
  41. Domains of attraction in neural networks. J. Phys. France, 49(10):1657–1662, 1988. doi: 10.1051/jphys:0198800490100165700. URL https://doi.org/10.1051/jphys:0198800490100165700.
  42. Kovács, G. An empirical comparison and evaluation of minority oversampling techniques on a large number of imbalanced datasets. Applied Soft Computing, 83:105662, 2019. ISSN 1568-4946. doi: https://doi.org/10.1016/j.asoc.2019.105662. URL https://www.sciencedirect.com/science/article/pii/S1568494619304429.
  43. Evolutionary undersampling boosting for imbalanced classification of breast cancer malignancy. Applied Soft Computing, 38:714–726, 2016. ISSN 1568-4946. doi: https://doi.org/10.1016/j.asoc.2015.08.060. URL https://www.sciencedirect.com/science/article/pii/S1568494615005815.
  44. Addressing the curse of imbalanced training sets: one-sided selection. In 14th International Conference on Machine Learning(ICML97), pp.  179–186, 1997. ISBN 1558604863.
  45. Laurikkala, J. Improving identification of difficult small classes by balancing class distribution. In Quaglini, S., Barahona, P., and Andreassen, S. (eds.), Artificial Intelligence in Medicine, pp.  63–66, Berlin, Heidelberg, 2001. Springer Berlin Heidelberg. ISBN 978-3-540-48229-1.
  46. Deep neural networks as gaussian processes. In International Conference on Learning Representations, 2018. URL https://openreview.net/forum?id=B1EA-M-0Z.
  47. Imbalanced-learn: A python toolbox to tackle the curse of imbalanced datasets in machine learning. Journal of Machine Learning Research, 18(17):1–5, 2017. URL http://jmlr.org/papers/v18/16-365.html.
  48. Imbalanced classification problems: Systematic study, issues and best practices. In Zhang, R., Zhang, J., Zhang, Z., Filipe, J., and Cordeiro, J. (eds.), Enterprise Information Systems, pp.  35–50, Berlin, Heidelberg, 2012. Springer Berlin Heidelberg. ISBN 978-3-642-29958-2. doi: 10.1007/978-3-642-29958-2_3.
  49. Emergence of preferred structures in a simple model of protein folding. Science, 273(5275):666–669, 1996.
  50. Addressing the class imbalance problem in twitter spam detection using ensemble learning. Computers & Security, 69:35–49, 2017. ISSN 0167-4048. doi: 10.1016/j.cose.2016.12.004. URL https://www.sciencedirect.com/science/article/pii/S0167404816301754.
  51. Imbalanced text classification: A term weighting approach. Expert Systems with Applications, 36(1):690–701, 2009. ISSN 0957-4174. doi: 10.1016/j.eswa.2007.10.042. URL https://www.sciencedirect.com/science/article/pii/S0957417407005350.
  52. Storage of correlated patterns in a perceptron. Journal of Physics A: Mathematical and General, 28(16):L447, aug 1995. doi: 10.1088/0305-4470/28/16/005. URL https://dx.doi.org/10.1088/0305-4470/28/16/005.
  53. Learning gaussian mixtures with generalized linear models: Precise asymptotics in high-dimensions. In Ranzato, M., Beygelzimer, A., Dauphin, Y., Liang, P., and Vaughan, J. W. (eds.), Advances in Neural Information Processing Systems, volume 34, pp.  10144–10157. Curran Associates, Inc., 2021. URL https://proceedings.neurips.cc/paper_files/paper/2021/file/543e83748234f7cbab21aa0ade66565f-Paper.pdf.
  54. Bias-inducing geometries: an exactly solvable data model with fairness implications, 2023.
  55. On the statistical consistency of algorithms for binary classification under class imbalance. In International Conference on Machine Learning, pp. 603–611. PMLR, 2013.
  56. Mezard, M. The space of interactions in neural networks: Gardner’s computation with the cavity method. Journal of Physics A: Mathematical and General, 22(12):2181, jun 1989. doi: 10.1088/0305-4470/22/12/018. URL https://dx.doi.org/10.1088/0305-4470/22/12/018.
  57. The role of regularization in classification of high-dimensional noisy Gaussian mixture. In III, H. D. and Singh, A. (eds.), Proceedings of the 37th International Conference on Machine Learning, volume 119 of Proceedings of Machine Learning Research, pp.  6874–6883. PMLR, 13–18 Jul 2020. URL https://proceedings.mlr.press/v119/mignacco20a.html.
  58. Protein folding theory: from lattice to all-atom models. Annual review of biophysics and biomolecular structure, 30(1):361–396, 2001.
  59. Deep generative models to counter class imbalance: A model-metric mapping with proportion calibration methodology. IEEE Access, 9:55879–55897, 2021.
  60. Monasson, R. Properties of neural networks storing spatially correlated patterns. Journal of Physics A: Mathematical and General, 25(13):3701, jul 1992. doi: 10.1088/0305-4470/25/13/019. URL https://dx.doi.org/10.1088/0305-4470/25/13/019.
  61. Minority oversampling for imbalanced data via class-preserving regularized auto-encoders. In Ruiz, F., Dy, J., and van de Meent, J.-W. (eds.), Proceedings of The 26th International Conference on Artificial Intelligence and Statistics, volume 206 of Proceedings of Machine Learning Research, pp.  3440–3465. PMLR, 25–27 Apr 2023. URL https://proceedings.mlr.press/v206/mondal23a.html.
  62. Storage capacity of a potts-perceptron. J. Phys. I France, 1(8):1109–1121, 1991. doi: 10.1051/jp1:1991104. URL https://doi.org/10.1051/jp1:1991104.
  63. A self consistent theory of gaussian processes captures feature learning effects in finite CNNs. In Ranzato, M., Beygelzimer, A., Dauphin, Y., Liang, P., and Vaughan, J. W. (eds.), Advances in Neural Information Processing Systems, volume 34, pp.  21352–21364. Curran Associates, Inc., 2021. URL https://proceedings.neurips.cc/paper_files/paper/2021/file/b24d21019de5e59da180f1661904f49a-Paper.pdf.
  64. A statistical mechanics framework for bayesian deep neural networks beyond the infinite-width limit, Dec 2023. ISSN 2522-5839. URL https://doi.org/10.1038/s42256-023-00767-6.
  65. Pastore, M. Critical properties of the sat/unsat transitions in the classification problem of structured data. Journal of Statistical Mechanics: Theory and Experiment, 2021(11):113301, 11 2021. doi: 10.1088/1742-5468/ac312b. URL https://dx.doi.org/10.1088/1742-5468/ac312b.
  66. Statistical learning theory of structured data. Phys. Rev. E, 102:032119, Sep 2020. doi: 10.1103/PhysRevE.102.032119. URL https://link.aps.org/doi/10.1103/PhysRevE.102.032119.
  67. Scikit-learn: Machine learning in Python. Journal of Machine Learning Research, 12:2825–2830, 2011. URL http://www.jmlr.org/papers/volume12/pedregosa11a/pedregosa11a.pdf.
  68. Are Gaussian data all you need? The extents and limits of universality in high-dimensional generalized linear estimation. In Krause, A., Brunskill, E., Cho, K., Engelhardt, B., Sabato, S., and Scarlett, J. (eds.), Proceedings of the 40th International Conference on Machine Learning, volume 202 of Proceedings of Machine Learning Research, pp.  27680–27708. PMLR, 23–29 Jul 2023. URL https://proceedings.mlr.press/v202/pesce23a.html.
  69. Beyond the storage capacity: Data-driven satisfiability transition. Phys. Rev. Lett., 125:120601, Sep 2020. doi: 10.1103/PhysRevLett.125.120601. URL https://link.aps.org/doi/10.1103/PhysRevLett.125.120601.
  70. Adaptive document image binarization. Pattern Recognition, 33(2):225–236, 2000. ISSN 0031-3203. doi: https://doi.org/10.1016/S0031-3203(99)00055-2. URL https://www.sciencedirect.com/science/article/pii/S0031320399000552.
  71. Enumeration of all compact conformations of copolymers with random sequence of links. The Journal of Chemical Physics, 93(8):5967–5971, 1990.
  72. Precision/recall on imbalanced test data. In Ruiz, F., Dy, J., and van de Meent, J.-W. (eds.), Proceedings of The 26th International Conference on Artificial Intelligence and Statistics, volume 206 of Proceedings of Machine Learning Research, pp.  9879–9891. PMLR, 25–27 Apr 2023. URL https://proceedings.mlr.press/v206/shang23a.html.
  73. Inferring protein sequence-function relationships with large-scale positive-unlabeled learning. Cell Systems, 12(1):92–101.e8, 2021. ISSN 2405-4712. doi: 10.1016/j.cels.2020.10.007. URL https://www.sciencedirect.com/science/article/pii/S2405471220304142.
  74. Tieleman, T. Training restricted boltzmann machines using approximations to the likelihood gradient. In Proceedings of the 25th international conference on Machine learning, pp.  1064–1071, 2008.
  75. Variational autoencoder based synthetic data generation for imbalanced learning. In 2017 IEEE Symposium Series on Computational Intelligence (SSCI), pp.  1–7, 2017. doi: 10.1109/SSCI.2017.8285168.
  76. PSoL: a positive sample only learning algorithm for finding non-coding RNA genes. Bioinformatics, 22(21):2590–2596, 08 2006. ISSN 1367-4803. doi: 10.1093/bioinformatics/btl441. URL https://doi.org/10.1093/bioinformatics/btl441.
  77. Weiss, G. M. Mining with rarity: A unifying framework. SIGKDD Explor. Newsl., 6(1):7–19, jun 2004. ISSN 1931-0145. doi: 10.1145/1007730.1007734. URL https://doi.org/10.1145/1007730.1007734.
  78. Visual transformers: Token-based image representation and processing for computer vision, 2020.
  79. Positive-unlabeled learning for disease gene identification. Bioinformatics, 28(20):2640–2647, 08 2012. ISSN 1367-4803. doi: 10.1093/bioinformatics/bts504. URL https://doi.org/10.1093/bioinformatics/bts504.
  80. RBM-SMOTE: Restricted boltzmann machines for synthetic minority oversampling technique. In Nguyen, N. T., Trawiński, B., and Kosala, R. (eds.), Intelligent Information and Database Systems, pp.  377–386, Cham, 2015. Springer International Publishing. ISBN 978-3-319-15702-3.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (4)
  1. Emanuele Loffredo (3 papers)
  2. Mauro Pastore (15 papers)
  3. Simona Cocco (32 papers)
  4. Rémi Monasson (28 papers)
Citations (4)

Summary

We haven't generated a summary for this paper yet.