Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash 100 tok/s
Gemini 2.5 Pro 58 tok/s Pro
GPT-5 Medium 29 tok/s
GPT-5 High 29 tok/s Pro
GPT-4o 103 tok/s
GPT OSS 120B 480 tok/s Pro
Kimi K2 215 tok/s Pro
2000 character limit reached

An In-Depth Analysis of Data Reduction Methods for Sustainable Deep Learning (2403.15150v1)

Published 22 Mar 2024 in cs.LG and cs.CV

Abstract: In recent years, Deep Learning has gained popularity for its ability to solve complex classification tasks, increasingly delivering better results thanks to the development of more accurate models, the availability of huge volumes of data and the improved computational capabilities of modern computers. However, these improvements in performance also bring efficiency problems, related to the storage of datasets and models, and to the waste of energy and time involved in both the training and inference processes. In this context, data reduction can help reduce energy consumption when training a deep learning model. In this paper, we present up to eight different methods to reduce the size of a tabular training dataset, and we develop a Python package to apply them. We also introduce a representativeness metric based on topology to measure how similar are the reduced datasets and the full training dataset. Additionally, we develop a methodology to apply these data reduction methods to image datasets for object detection tasks. Finally, we experimentally compare how these data reduction methods affect the representativeness of the reduced dataset, the energy consumption and the predictive performance of the model.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (69)
  1. [Dataset] Dry Bean Dataset. UCI Machine Learning Repository, 2020. doi:10.24432/C50S4B.
  2. [email protected]. [dataset] wheelchair detection dataset, nov 2021. URL: https://universe.roboflow.com/2458761304-qq-com/wheelchair-detection.
  3. Learning activation functions to improve deep neural networks. arXiv: Neural and Evolutionary Computing, 2014. doi:10.48550/arXiv.1412.6830.
  4. Effect of data scaling methods on machine learning algorithms and model performance. Technologies, 9(3):52, 2021. doi:10.3390/technologies9030052.
  5. Supervised principal component analysis: Visualization, classification and regression on subspaces and submanifolds. Pattern Recognition, 44(7):1357–1371, 2011. doi:10.1016/j.patcog.2010.12.015.
  6. Nearest prototype classifier designs: An experimental study. International journal of Intelligent systems, 16(12):1445–1473, 2001. doi:10.1002/int.1068.
  7. The balanced accuracy and its posterior distribution. In 2010 20th international conference on pattern recognition, pages 3121–3124. IEEE, 2010. doi:10.1109/ICPR.2010.764.
  8. Albumentations: Fast and flexible image augmentations. Information, 11(2), 2020. URL: https://www.mdpi.com/2078-2489/11/2/125, doi:10.3390/info11020125.
  9. k-means–: A unified approach to clustering and outlier detection. In Proceedings of the 2013 SIAM international conference on data mining, pages 189–197. SIAM, 2013. doi:10.1137/1.9781611972832.21.
  10. Selection via proxy: Efficient data selection for deep learning. arXiv, 2019. doi:10.48550/arXiv.1906.11829.
  11. CodeCarbon contributors. Codecarbon: A python library for carbon emission quantification. URL: https://codecarbon.io/.
  12. Instance ranking and numerosity reduction using matrix decomposition and subspace learning. In Canadian Conference on Artificial Intelligence, pages 160–172. Springer, 2019. doi:10.1007/978-3-030-18305-9_13.
  13. Pooling methods in deep neural networks, a review. ArXiv, abs/2009.07485, 2020. doi:10.48550/arXiv.2009.07485.
  14. Singular value decomposition and least squares solutions. In Handbook for Automatic Computation: Volume II: Linear Algebra, pages 134–151. Springer, 1971. doi:10.1007/BF02163027.
  15. Matrix Computations - 4th Edition. Johns Hopkins University Press, Philadelphia, PA, 2013. doi:10.1137/1.9781421407944.
  16. Topology-based representative datasets to reduce neural network training resources. Neural Computing and Applications, 34(17):14397–14413, September 2022. doi:10.1007/s00521-022-07252-y.
  17. Efficient data representation by selecting prototypes with importance weights. In 2019 IEEE International Conference on Data Mining (ICDM), pages 260–269. IEEE, 2019. doi:http://dx.doi.org/10.1109/ICDM.2019.00036.
  18. Spatial pyramid pooling in deep convolutional networks for visual recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence, 37, 06 2014. doi:10.1109/TPAMI.2015.2389824.
  19. End-to-end training of object class detectors for mean average precision. In Computer Vision–ACCV 2016: 13th Asian Conference on Computer Vision, Taipei, Taiwan, November 20-24, 2016, Revised Selected Papers, Part V 13, pages 198–213. Springer, 2017. doi:10.1007/978-3-319-54193-8_13.
  20. K-means clustering algorithms: A comprehensive review, variants analysis, and advances in the era of big data. Information Sciences, 622:178–210, 2023. doi:10.1016/j.ins.2022.11.139.
  21. An overview of neural network. American Journal of Neural Networks and Applications, 5(1):7–11, 2019. doi:10.11648/j.ajnna.20190501.12.
  22. A robust bridge rivet identification method using deep learning and computer vision. Engineering Structures, 283:115809, 05 2023. doi:10.1016/j.engstruct.2023.115809.
  23. Glenn Jocher. Yolov5 by ultralytics, 2020. URL: https://github.com/ultralytics/yolov5, doi:10.5281/zenodo.3908559.
  24. Adam: A method for stochastic optimization. CoRR, abs/1412.6980, 2014. doi:10.48550/arXiv.1412.6980.
  25. Multiclass classification of dry beans using computer vision and machine learning techniques. Computers and Electronics in Agriculture, 174:105507, 2020. doi:10.1016/j.compag.2020.105507.
  26. Data-driven method for training data selection for deep learning. In 82nd EAGE Annual Conference & Exhibition, volume 2021, pages 1–5. European Association of Geoscientists & Engineers, 2021. doi:10.3997/2214-4609.202112817.
  27. Algorithms for non-negative matrix factorization. Advances in neural information processing systems, 13, 2000.
  28. K-means clustering with bagging and mapreduce. In 2011 44th Hawaii International Conference on System Sciences, pages 1–8. IEEE, 2011. doi:10.1109/HICSS.2011.265.
  29. Distance-entropy: An effective indicator for selecting informative data. Frontiers in Plant Science, 12, 01 2022. doi:10.3389/fpls.2021.818895.
  30. Network in network, 2014. doi:10.48550/arXiv.1312.4400.
  31. Microsoft coco: Common objects in context, 2015. doi:10.48550/arXiv.1405.0312.
  32. On issues of instance selection. Data Mining and Knowledge Discovery, 6(2):115, 2002. doi:http://dx.doi.org/10.1023/A:1014056429969.
  33. Path aggregation network for instance segmentation. In 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 8759–8768, 2018. doi:10.1109/CVPR.2018.00913.
  34. Energy usage reports: Environmental awareness as part of algorithmic accountability. In NeurIPS 2019 Workshop on Tackling Climate Change with Machine Learning, 2019. URL: https://www.climatechange.ai/papers/neurips2019/8.
  35. Online dictionary learning for sparse coding. In Proceedings of the 26th annual international conference on machine learning, pages 689–696, 2009.
  36. Cross-entropy loss functions: Theoretical analysis and applications. ArXiv, abs/2304.07288, 2023. doi:10.48550/arXiv.2304.07288.
  37. Performance validation of vehicle platooning through intelligible analytics. IET Cyber-Physical Systems: Theory & Applications, 4(2):120–127, 2019. URL: https://ietresearch.onlinelibrary.wiley.com/doi/abs/10.1049/iet-cps.2018.5055, arXiv:https://ietresearch.onlinelibrary.wiley.com/doi/pdf/10.1049/iet-cps.2018.5055, doi:10.1049/iet-cps.2018.5055.
  38. Anthropogenic and natural radiative forcing. Cambridge University Press, 2014. doi:10.1017/CBO9781107415324.018.
  39. A review of instance selection methods. Artificial Intelligence Review, 34:133–143, 2010. doi:http://dx.doi.org/10.1007/s10462-010-9165-y.
  40. Macro f1 and macro f1. arXiv, 2019. doi:10.48550/arXiv.1911.03347.
  41. An introduction to convolutional neural networks, 2015. doi:10.48550/arXiv.1511.08458.
  42. Pytorch: An imperative style, high-performance deep learning library. In Advances in Neural Information Processing Systems 32, pages 8024–8035. Curran Associates, Inc., 2019. doi:10.48550/arXiv.1912.01703.
  43. Scikit-learn: Machine learning in Python. Journal of Machine Learning Research, 12:2825–2830, 2011. doi:10.48550/arXiv.1201.0490.
  44. Repository Survey Green AI, March 2024. URL: https://github.com/Cimagroup/SurveyGreenAI/, doi:10.5281/zenodo.10844558.
  45. You only look once: Unified, real-time object detection. 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 779–788, 2015. doi:http://dx.doi.org/10.1109/CVPR.2016.91.
  46. Faster r-cnn: Towards real-time object detection with region proposal networks. IEEE Transactions on Pattern Analysis and Machine Intelligence, 39(6):1137–1149, 2017. doi:10.1109/TPAMI.2016.2577031.
  47. Generalized intersection over union: A metric and a loss for bounding box regression. 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 658–666, 2019. doi:10.1109/CVPR.2019.00075.
  48. Sebastian Ruder. An overview of gradient descent optimization algorithms. ArXiv, abs/1609.04747, 2016. doi:10.48550/arXiv.1609.04747.
  49. Jacob Salawetz. What is yolov5? a guide for beginners, 2020. URL: https://blog.roboflow.com/yolov5-improvements-and-evaluation/.
  50. Green AI. Communications of the ACM, 63(12):54–63, 2020. doi:10.1145/3381831.
  51. Deval Shah. Mean average precision (map) explained: Everything you need to know. Retrieved November, 4:2022, 2022. URL: https://www.v7labs.com/blog/mean-average-precision.
  52. Vinod Sharma. A study on data scaling methods for machine learning. International Journal for Global Academic & Scientific Research, 1(1):31–42, 2022. doi:10.55938/ijgasr.v1i1.4.
  53. Topological estimation using witness complexes. Proc. Sympos. Point-Based Graphics, 06 2004. doi:10.2312/SPBG/SPBG04/157-166.
  54. A systematic analysis of performance measures for classification tasks. Information processing & management, 45(4):427–437, 2009. doi:http://dx.doi.org/10.1016/j.ipm.2009.03.002.
  55. C Spearman. The proof and measurement of association between two things. American Journal of Psychology, 15:72–101, 1904. doi:10.1093/ije/dyq191.
  56. Technical summary. In Climate change 2013: the physical science basis. Contribution of Working Group I to the Fifth Assessment Report of the Intergovernmental Panel on Climate Change, pages 33–115. Cambridge University Press, 2013. doi:10.1017/CBO9781107415324.
  57. Bernadette J Stolz. Outlier-robust subsampling techniques for persistent homology. Journal of Machine Learning Research, 24, 2023. doi:10.48550/arXiv.2103.14743.
  58. Feed-Forward Neural Networks, page 73–86. Cambridge University Press, 2024. doi:10.1017/9781009026222.006.
  59. An empirical study of example forgetting during deep neural network learning. arXiv, 2018. doi:10.48550/arXiv.1812.05159.
  60. Repository experiments Survey Green AI, March 2024. URL: https://github.com/Cimagroup/Experiments-SurveyGreenAI/, doi:10.5281/zenodo.10844476.
  61. Deep detection of people and their mobility aids for a hospital robot. In Proc. of the IEEE Eur. Conf. on Mobile Robotics (ECMR), 2017. doi:10.1109/ECMR.2017.8098665.
  62. Data-centric green ai an exploratory empirical study. In 2022 international conference on ICT for sustainability (ICT4S), pages 35–45. IEEE, 2022. doi:10.1109/ICT4S55073.2022.00015.
  63. Cspnet: A new backbone that can enhance learning capability of cnn. 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), pages 1571–1580, 2019. doi:10.1109/CVPRW50498.2020.00203.
  64. A comprehensive survey of loss functions in machine learning. Annals of Data Science, pages 1–26, 2020. doi:10.1007/s40745-020-00253-5.
  65. Linear discriminant analysis. Robust data mining, pages 27–33, 2013. doi:10.1007/978-1-4419-9878-1_4.
  66. A survey on green deep learning. arXiv, 2021. doi:10.48550/arXiv.2111.05193.
  67. Clustering. John Wiley & Sons, 2008.
  68. A survey of modern deep learning based object detection models. Digital Signal Processing, 126:103514, 2022. URL: https://www.sciencedirect.com/science/article/pii/S1051200422001312, doi:10.1016/j.dsp.2022.103514.
  69. Data-centric Artificial Intelligence: A Survey, June 2023. arXiv:2303.10158 [cs]. URL: http://arxiv.org/abs/2303.10158, doi:10.48550/arXiv.2303.10158.
Citations (1)
List To Do Tasks Checklist Streamline Icon: https://streamlinehq.com

Collections

Sign up for free to add this paper to one or more collections.

Summary

We haven't generated a summary for this paper yet.

Ai Generate Text Spark Streamline Icon: https://streamlinehq.com

Paper Prompts

Sign up for free to create and run prompts on this paper using GPT-5.

Dice Question Streamline Icon: https://streamlinehq.com

Follow-up Questions

We haven't generated follow-up questions for this paper yet.

X Twitter Logo Streamline Icon: https://streamlinehq.com