Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
169 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
45 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Deep Learning Models on CPUs: A Methodology for Efficient Training (2206.10034v2)

Published 20 Jun 2022 in cs.LG and cs.AI

Abstract: GPUs have been favored for training deep learning models due to their highly parallelized architecture. As a result, most studies on training optimization focus on GPUs. There is often a trade-off, however, between cost and efficiency when deciding on how to choose the proper hardware for training. In particular, CPU servers can be beneficial if training on CPUs was more efficient, as they incur fewer hardware update costs and better utilizing existing infrastructure. This paper makes several contributions to research on training deep learning models using CPUs. First, it presents a method for optimizing the training of deep learning models on Intel CPUs and a toolkit called ProfileDNN, which we developed to improve performance profiling. Second, we describe a generic training optimization method that guides our workflow and explores several case studies where we identified performance issues and then optimized the Intel Extension for PyTorch, resulting in an overall 2x training performance increase for the RetinaNet-ResNext50 model. Third, we show how to leverage the visualization capabilities of ProfileDNN, which enabled us to pinpoint bottlenecks and create a custom focal loss kernel that was two times faster than the official reference PyTorch implementation.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (43)
  1. Mlperf: An industry standard benchmark suite for machine learning performance. IEEE Micro, 40(2):8–16, 2020.
  2. Fishrecgan: An end to end gan based network for fisheye rectification and calibration. arXiv preprint arXiv:2305.05222, 2023a.
  3. Deep learning convective flow using conditional generative adversarial networks. arXiv preprint arXiv:2005.06422, 2020.
  4. Rethinking complex-valued deep neural networks for monaural speech enhancement. arXiv preprint arXiv:2301.04320, 2023.
  5. Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems, 32, 2019.
  6. {{\{{TensorFlow}}\}}: A system for {{\{{Large-Scale}}\}} machine learning. In 12th USENIX symposium on operating systems design and implementation (OSDI 16), pages 265–283, 2016.
  7. Openvino deep learning workbench: Comprehensive analysis and tuning of neural networks inference. In Proceedings of the IEEE/CVF International Conference on Computer Vision Workshops, pages 0–0, 2019.
  8. James R Reinders. Sycl, dpc++, xpus, oneapi. In International Workshop on OpenCL, pages 1–1, 2021.
  9. Optimizing deep learning recommender systems training on cpu cluster architectures. In SC20: International Conference for High Performance Computing, Networking, Storage and Analysis, pages 1–15. IEEE, 2020.
  10. Benchmarking tpu, gpu, and cpu platforms for deep learning. arXiv preprint arXiv:1907.10701, 2019.
  11. Performance analysis and cpu vs gpu comparison for deep learning. In 2018 6th International Conference on Control Engineering & Information Technology (CEIT), pages 1–6. IEEE, 2018.
  12. Benchmarking state-of-the-art deep learning software tools. In 2016 7th International Conference on Cloud Computing and Big Data (CCBD), pages 99–104. IEEE, 2016.
  13. Benchmarking contemporary deep learning hardware and frameworks: A survey of qualitative metrics. In 2019 IEEE First International Conference on Cognitive Machine Intelligence (CogMI), pages 148–155. IEEE, 2019.
  14. Yanli Qian. Profiling and characterization of deep learning model inference on CPU. PhD thesis, 2020.
  15. Learning personalized page content ranking using customer representation. arXiv preprint arXiv:2305.05267, 2023b.
  16. Reducing mac operation in convolutional neural network with sign prediction. In 2018 International Conference on Information and Communication Technology Convergence (ICTC), pages 177–182. IEEE, 2018.
  17. Alexander Wong. Netscore: towards universal metrics for large-scale performance analysis of deep neural networks for practical on-device edge usage. In International Conference on Image Analysis and Recognition, pages 15–26. Springer, 2019.
  18. Mlperf training benchmark, 2019.
  19. Oneapi deep neural network library (onednn). https://github.com/oneapi-src/oneDNN.
  20. Intel extension for pytorch. https://github.com/intel/intel-extension-for-pytorch.
  21. Computational resource consumption in convolutional neural network training–a focus on memory. Supercomputing Frontiers and Innovations, 8(1):45–61, 2021.
  22. Cuda, release: 10.2.89, 2020. URL https://developer.nvidia.com/cuda-toolkit.
  23. Focal loss for dense object detection. In Proceedings of the IEEE international conference on computer vision, pages 2980–2988, 2017.
  24. Ahmad Yasin. A top-down method for performance analysis and counters architecture. In 2014 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS), pages 35–44. IEEE, 2014.
  25. James Reinders. VTune performance analyzer essentials, volume 9. Intel Press Santa Clara, 2005.
  26. Imagenet classification with deep convolutional neural networks. Advances in neural information processing systems, 25, 2012.
  27. Speech recognition with deep recurrent neural networks. In 2013 IEEE international conference on acoustics, speech and signal processing, pages 6645–6649. Ieee, 2013.
  28. Attention is all you need. Advances in neural information processing systems, 30, 2017.
  29. Soft: Softmax-free transformer with linear complexity. Advances in Neural Information Processing Systems, 34, 2021.
  30. Text is all you need: Learning language representations for sequential recommendation. arXiv preprint arXiv:2305.13731, 2023.
  31. Data efficient training with imbalanced label sample distribution for fashion detection. arXiv preprint arXiv:2305.04379, 2023c.
  32. Microsoft coco: Common objects in context. In European conference on computer vision, pages 740–755. Springer, 2014.
  33. The open images dataset v4. International Journal of Computer Vision, 128(7):1956–1981, 2020.
  34. A stochastic approximation method. The annals of mathematical statistics, pages 400–407, 1951.
  35. Large batch optimization for deep learning: Training bert in 76 minutes. arXiv preprint arXiv:1904.00962, 2019.
  36. Using MPI: portable parallel programming with the message-passing interface, volume 1. MIT press, 1999.
  37. Gloo. https://github.com/facebookincubator/gloo.
  38. On scale-out deep learning training for cloud and hpc. arXiv preprint arXiv:1801.08030, 2018.
  39. Mlcommons rcp. https://github.com/mlcommons/logging/tree/master/mlperf_logging/rcp_checker/training_2.0.0.
  40. Fastaudio: A learnable audio front-end for spoof speech detection. In ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 3693–3697. IEEE, 2022.
  41. A transformer-based approach for translating natural language to bash commands. In 2021 20th IEEE International Conference on Machine Learning and Applications (ICMLA), pages 1245–1248. IEEE, 2021.
  42. A prompt pattern catalog to enhance prompt engineering with chatgpt. arXiv preprint arXiv:2302.11382, 2023.
  43. Torchvision the machine-vision package of torch. In Proceedings of the 18th ACM international conference on Multimedia, pages 1485–1488, 2010.
Citations (1)

Summary

We haven't generated a summary for this paper yet.