Harnessing Deep Learning and HPC Kernels via High-Level Loop and Tensor Abstractions on CPU Architectures (2304.12576v2)
Abstract: During the past decade, Deep Learning (DL) algorithms, programming systems and hardware have converged with the High Performance Computing (HPC) counterparts. Nevertheless, the programming methodology of DL and HPC systems is stagnant, relying on highly-optimized, yet platform-specific and inflexible vendor-optimized libraries. Such libraries provide close-to-peak performance on specific platforms, kernels and shapes thereof that vendors have dedicated optimizations efforts, while they underperform in the remaining use-cases, yielding non-portable codes with performance glass-jaws. This work introduces a framework to develop efficient, portable DL and HPC kernels for modern CPU architectures. We decompose the kernel development in two steps: 1) Expressing the computational core using Tensor Processing Primitives (TPPs): a compact, versatile set of 2D-tensor operators, 2) Expressing the logical loops around TPPs in a high-level, declarative fashion whereas the exact instantiation (ordering, tiling, parallelization) is determined via simple knobs. We demonstrate the efficacy of our approach using standalone kernels and end-to-end workloads that outperform state-of-the-art implementations on diverse CPU platforms.
- Image classification with deep convolutional neural networks. Advances in neural information processing systems, pages 1097–1105, 2012.
- Going deeper with convolutions. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 1–9, 2015.
- Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556, 2014.
- Feature learning in deep neural networks-studies on speech recognition tasks. arXiv preprint arXiv:1301.3605, 2013.
- Google’s neural machine translation system: Bridging the gap between human and machine translation. arXiv preprint arXiv:1609.08144, 2016.
- Wide & deep learning for recommender systems. In Proceedings of the 1st Workshop on Deep Learning for Recommender Systems, pages 7–10. ACM, 2016.
- Transformers: State-of-the-art natural language processing. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pages 38–45, 2020.
- Deep learning in drug discovery. Molecular informatics, 35(1):3–14, 2016.
- Deep learning for computational chemistry. Journal of computational chemistry, 38(16):1291–1307, 2017.
- A survey of deep learning for scientific discovery. arXiv preprint arXiv:2003.11755, 2020.
- High-performance tensor computations in scientific computing and data science. Frontiers in Applied Mathematics and Statistics, page 93, 2022.
- Tensor processing primitives: A programming abstraction for efficiency and portability in deep learning workloads. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, pages 1–14, 2021.
- Machine learning systems are stuck in a rut. In Proceedings of the Workshop on Hot Topics in Operating Systems, pages 177–183, 2019.
- Stripe: Tensor compilation via the nested polyhedral model. arXiv preprint arXiv:1903.06498, 2019.
- {{\{{TVM}}\}}: An automated end-to-end optimizing compiler for deep learning. In 13th {normal-{\{{USENIX}normal-}\}} Symposium on Operating Systems Design and Implementation ({normal-{\{{OSDI}normal-}\}} 18), pages 578–594, 2018.
- Tensor comprehensions: Framework-agnostic high-performance machine learning abstractions. arXiv preprint arXiv:1802.04730, 2018.
- Ansor: Generating high-performance tensor programs for deep learning. In 14th {normal-{\{{USENIX}normal-}\}} Symposium on Operating Systems Design and Implementation ({normal-{\{{OSDI}normal-}\}} 20), pages 863–879, 2020.
- Snowflake: a lightweight portable stencil dsl. In 2017 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW), pages 795–804. IEEE, 2017.
- Halide: a language and compiler for optimizing parallelism, locality, and recomputation in image processing pipelines. Acm Sigplan Notices, 48(6):519–530, 2013.
- Anatomy of high-performance deep learning convolutions on SIMD architectures. In Proceedings of the International Conference for High Performance Computing, Networking, Storage, and Analysis, page 66. IEEE Press, 2018.
- Harnessing deep learning via a single building block. In 2020 IEEE International Parallel and Distributed Processing Symposium (IPDPS), pages 222–233. IEEE, 2020.
- Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016.
- Parallel programming in OpenMP. Morgan kaufmann, 2001.
- Chuck Pheatt. Intel® threading building blocks. Journal of Computing Sciences in Colleges, 23(4):298–298, 2008.
- Pthreads programming: A POSIX standard for better multiprocessing. ” O’Reilly Media, Inc.”, 1996.
- Anatomy of high-performance matrix multiplication. ACM Transactions on Mathematical Software (TOMS), 34(3):12, 2008.
- Opentuner: An extensible framework for program autotuning. In Proceedings of the 23rd international conference on Parallel architectures and compilation, pages 303–316, 2014.
- Polydl: Polyhedral optimizations for creation of high-performance dl primitives. ACM Transactions on Architecture and Code Optimization (TACO), 18(1):1–27, 2021.
- Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901, 2020.
- Deep learning recommendation model for personalization and recommendation systems. arXiv preprint arXiv:1906.00091, 2019.
- LIBXSMM: Accelerating small matrix multiplications by runtime code generation. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, SC ’16, pages 84:1–84:11, Piscataway, NJ, USA, 2016. IEEE Press.
- The arm scalable vector extension. IEEE micro, 37(2):26–39, 2017.
- Parallel efficient sparse matrix-matrix multiplication on multicore platforms. In High Performance Computing: 30th International Conference, ISC High Performance 2015, Frankfurt, Germany, July 12-16, 2015, Proceedings, pages 48–57. Springer, 2015.
- Sparsity in deep learning: Pruning and growth for efficient inference and training in neural networks. The Journal of Machine Learning Research, 22(1):10882–11005, 2021.
- Block pruning for faster transformers. arXiv preprint arXiv:2109.04838, 2021.
- Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805, 2018.
- GPT-J-6B: A 6 Billion Parameter Autoregressive Language Model. https://github.com/kingoflolz/mesh-transformer-jax, May 2021.
- Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288, 2023.
- Transformers: State-of-the-art natural language processing. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pages 38–45, Online, October 2020. Association for Computational Linguistics.
- Optimizing deep learning rnn topologies on intel architecture. Supercomputing Frontiers and Innovations, 6(3):64–85, 2019.
- Deepcpu: Serving rnn-based deep learning models 10x faster. In 2018 {normal-{\{{USENIX}normal-}\}} Annual Technical Conference ({normal-{\{{USENIX}normal-}\}}{normal-{\{{ATC}normal-}\}} 18), pages 951–965, 2018.
- The optimal bert surgeon: Scalable and accurate second-order pruning for large language models. arXiv preprint arXiv:2203.07259, 2022.
- Intel oneDNN. https://github.com/oneapi-src/onednn, Accessed on 4/6/2023.
- ARM. https://github.com/arm-software/computelibrary, Accessed on 4/6/2023.
- Modular. https://www.modular.com/blog/the-worlds-fastest-unified-matrix-multiplication, Accessed on 4/20/2023.
- onednn graph compiler: A hybrid approach for high-performance deep learning compilation. arXiv preprint arXiv:2301.01333, 2023.
- Hugging Faces. https://github.com/huggingface/transformers, Accessed on 4/6/2023.
- Intel Extension for PyTorch. https://github.com/intel/intel-extension-for-pytorch, Accessed on 4/6/2023.
- ML Commons. https://mlcommons.org/en/training-normal-21/, Accessed on 4/6/2023.
- Neural magic. https://neuralmagic.com/blog/pruning-hugging-face-bert-compound-sparsification/, Accessed on 27/9/2023.
- IREE. https://github.com/openxla/iree, Accessed on 4/6/2023.
- A survey on domain-specific languages for machine learning in big data. arXiv preprint arXiv:1602.07637, 2016.
- CUTLASS, 1 2023.
- Composable Kernel.
- Evangelos Georganas (18 papers)
- Dhiraj Kalamkar (15 papers)
- Kirill Voronin (6 papers)
- Abhisek Kundu (16 papers)
- Antonio Noack (1 paper)
- Hans Pabst (10 papers)
- Alexander Breuer (4 papers)
- Alexander Heinecke (21 papers)