NonGEMM Bench: Understanding the Performance Horizon of the Latest ML Workloads with NonGEMM Workloads (2404.11788v3)
Abstract: Machine Learning (ML) operators are the building blocks to design ML models with various target applications. GEneral Matrix Multiplication (GEMM) operators are the backbone of ML models. They are notorious for being computationally expensive requiring billions of multiply-and-accumulate. Therefore, significant effort has been put to study and optimize the GEMM operators in order to speed up the execution of ML models. GPUs and accelerators are widely deployed to accelerate ML workloads by optimizing the execution of GEMM operators. Nonetheless, the performance of NonGEMM operators have not been studied as thoroughly as GEMMs. Therefore, this paper describes \bench, a benchmark to study NonGEMM operators. We first construct \bench using popular ML workloads from different domains, then perform case studies on various grade GPU platforms to analyze the behavior of NonGEMM operators in GPU accelerated systems. Finally, we present some key takeaways to bridge the gap between GEMM and NonGEMM operators and to offer the community with potential new optimization directions.
- Hugging face models hub. https://huggingface.co/models. Accessed on March 8, 2024.
- Hugging face models hub - object detection. https://huggingface.co/models?pipeline_tag=object-detection&sort=downloads. Accessed on March 8, 2024.
- Hugging face models hub - text generation. https://huggingface.co/models?pipeline_tag=text-generation&sort=downloads. Accessed on March 8, 2024.
- WikiText-2 Dataset. = ”https://huggingface.co/datasets/wikitext. Accessed on April 12, 2024.
- Gpt-4 technical report. arXiv preprint arXiv:2303.08774, 2023.
- Layer normalization. arXiv preprint arXiv:1607.06450, 2016.
- End-to-end object detection with transformers. In European conference on computer vision, pages 213–229. Springer, 2020.
- Per-pixel classification is not all you need for semantic segmentation. Advances in Neural Information Processing Systems, 34:17864–17875, 2021.
- cudnn: Efficient primitives for deep learning. arXiv preprint arXiv:1410.0759, 2014.
- ImageNet: A Large-Scale Hierarchical Image Database. In CVPR09, 2009.
- Tandem processor: Grappling with emerging operators in neural networks. In Proceedings of the 29th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, 2024.
- Torchbench: Benchmarking pytorch with high api surface coverage. arXiv preprint arXiv:2304.14226, 2023.
- Mask r-cnn. In Proceedings of the IEEE international conference on computer vision (ICCV), pages 2961–2969, 2017.
- Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016.
- Gaussian error linear units (gelus). arXiv preprint arXiv:1606.08415, 2016.
- Batch normalization: Accelerating deep network training by reducing internal covariate shift. In International conference on machine learning, pages 448–456. pmlr, 2015.
- Jacob Devlin Ming-Wei Chang Kenton and Lee Kristina Toutanova. Bert: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of NAACL-HLT, pages 4171–4186, 2019.
- An image is worth 16x16 words: Transformers for image recognition at scale. In Proceedings of the International Conference on Learning Representations (ICLR), 2021.
- Longtail-bench: A benchmark suite for domain-specific operators in deep learning. In 2022 IEEE International Symposium on Workload Characterization (IISWC), pages 282–295. IEEE, 2022.
- Microsoft coco: Common objects in context. In Computer Vision–ECCV 2014: 13th European Conference, Zurich, Switzerland, September 6-12, 2014, Proceedings, Part V 13, pages 740–755. Springer, 2014.
- Swin transformer: Hierarchical vision transformer using shifted windows. In Proceedings of the IEEE/CVF international conference on computer vision (ICCV), pages 10012–10022, 2021.
- Language models are unsupervised multitask learners. OpenAI blog, 1(8):9, 2019.
- Mlperf inference benchmark. In 2020 ACM/IEEE 47th Annual International Symposium on Computer Architecture (ISCA), pages 446–459. IEEE, 2020.
- Torch. fx: Practical program capture and transformation for deep learning in python. Proceedings of Machine Learning and Systems, 4:638–651, 2022.
- Faster r-cnn: Towards real-time object detection with region proposal networks. Advances in neural information processing systems, 28, 2015.
- Baidu Research. Deepbench, 2016.
- Mobilenetv2: Inverted residuals and linear bottlenecks. In Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR), pages 4510–4520, 2018.
- Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288, 2023.
- Attention is all you need. Advances in neural information processing systems, 30, 2017.
- Segformer: Simple and efficient design for semantic segmentation with transformers. Advances in Neural Information Processing Systems, 34:12077–12090, 2021.
- Root mean square layer normalization. Advances in Neural Information Processing Systems, 32, 2019.