Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
119 tokens/sec
GPT-4o
56 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

SPLAT: A framework for optimised GPU code-generation for SParse reguLar ATtention (2407.16847v1)

Published 23 Jul 2024 in cs.PL and cs.LG

Abstract: Multi-head-self-attention (MHSA) mechanisms achieve state-of-the-art (SOTA) performance across natural language processing and vision tasks. However, their quadratic dependence on sequence lengths has bottlenecked inference speeds. To circumvent this bottleneck, researchers have proposed various sparse-MHSA models, where a subset of full attention is computed. Despite their promise, current sparse libraries and compilers do not support high-performance implementations for diverse sparse-MHSA patterns due to the underlying sparse formats they operate on. These formats, which are typically designed for high-performance & scientific computing applications, are either curated for extreme amounts of random sparsity (<1% non-zero values), or specific sparsity patterns. However, the sparsity patterns in sparse-MHSA are moderately sparse (10-50% non-zero values) and varied, resulting in existing sparse-formats trading off generality for performance. We bridge this gap, achieving both generality and performance, by proposing a novel sparse format: affine-compressed-sparse-row (ACSR) and supporting code-generation scheme, SPLAT, that generates high-performance implementations for diverse sparse-MHSA patterns on GPUs. Core to our proposed format and code generation algorithm is the observation that common sparse-MHSA patterns have uniquely regular geometric properties. These properties, which can be analyzed just-in-time, expose novel optimizations and tiling strategies that SPLAT exploits to generate high-performance implementations for diverse patterns. To demonstrate SPLAT's efficacy, we use it to generate code for various sparse-MHSA models, achieving geomean speedups of 2.05x and 4.05x over hand-written kernels written in triton and TVM respectively on A100 GPUs. Moreover, its interfaces are intuitive and easy to use with existing implementations of MHSA in JAX.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (37)
  1. [n. d.]. cuBLAS — developer.nvidia.com. https://developer.nvidia.com/cublas.
  2. [n. d.]. cuSPARSE — developer.nvidia.com. https://developer.nvidia.com/cusparse.
  3. [n. d.]. SuiteSparse. https://sparse.tamu.edu/.
  4. Tiramisu: A Polyhedral Compiler for Expressing Fast and Portable Code. CoRR abs/1804.10694 (2018). arXiv:1804.10694 http://arxiv.org/abs/1804.10694
  5. Longformer: The Long-Document Transformer. arXiv:2004.05150 [cs.CL]
  6. A Practical Automatic Polyhedral Parallelizer and Locality Optimizer. In Proceedings of the 29th ACM SIGPLAN Conference on Programming Language Design and Implementation (Tucson, AZ, USA) (PLDI ’08). Association for Computing Machinery, New York, NY, USA, 101–113. https://doi.org/10.1145/1375581.1375595
  7. JAX: composable transformations of Python+NumPy programs. http://github.com/google/jax
  8. Language Models are Few-Shot Learners. CoRR abs/2005.14165 (2020). arXiv:2005.14165 https://arxiv.org/abs/2005.14165
  9. TVM: An Automated End-to-End Optimizing Compiler for Deep Learning. In Proceedings of the 13th USENIX Conference on Operating Systems Design and Implementation (Carlsbad, CA, USA) (OSDI’18). USENIX Association, USA, 579–594.
  10. LongLoRA: Efficient Fine-tuning of Long-Context Large Language Models. arXiv:2309.12307 [cs.CL] https://arxiv.org/abs/2309.12307
  11. Runtime Composition of Iterations for Fusing Loop-carried Sparse Dependence. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (Denver, CO, USA) (SC ’23). Association for Computing Machinery, New York, NY, USA, Article 89, 15 pages. https://doi.org/10.1145/3581784.3607097
  12. Generating Long Sequences with Sparse Transformers. CoRR abs/1904.10509 (2019). arXiv:1904.10509 http://arxiv.org/abs/1904.10509
  13. Unified Sparse Formats for Tensor Algebra Compilers. CoRR abs/1804.10112 (2018). arXiv:1804.10112 http://arxiv.org/abs/1804.10112
  14. Clement Farabet and Tris Warkentin. [n. d.]. Gemma 2 is now available to researchers and developers — blog.google. https://blog.google/technology/developers/google-gemma-2/. [Accessed 15-07-2024].
  15. Sparse GPU kernels for deep learning. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (Atlanta, Georgia) (SC ’20). IEEE Press, Article 17, 14 pages.
  16. A Systematic Survey of General Sparse Matrix-Matrix Multiplication. ACM Comput. Surv. 55, 12, Article 244 (mar 2023), 36 pages. https://doi.org/10.1145/3571157
  17. Block-Sparse GPU Kernels. https://openai.com/research/block-sparse-gpu-kernels. https://openai.com/research/block-sparse-gpu-kernels
  18. Polly - Performing Polyhedral Optimizations on a Low-Level Intermediate Representation. Parallel Process. Lett. 22 (2012). https://api.semanticscholar.org/CorpusID:18533155
  19. Adaptive sparse tiling for sparse matrix multiplication. In Proceedings of the 24th Symposium on Principles and Practice of Parallel Programming (Washington, District of Columbia) (PPoPP ’19). Association for Computing Machinery, New York, NY, USA, 300–314. https://doi.org/10.1145/3293883.3295712
  20. Victor Eijkhout Jack Dongarra and Henk van der Vorst. [n. d.]. SparseBench. https://www.netlib.org/benchmark/sparsebench/.
  21. Mistral 7B. arXiv:2310.06825 [cs.CL] https://arxiv.org/abs/2310.06825
  22. David B. Kirk and Wen-mei W. Hwu. 2010. Programming Massively Parallel Processors: A Hands-on Approach. Morgan Kaufmann Publishers, Burlington, MA.
  23. Reformer: The Efficient Transformer. CoRR abs/2001.04451 (2020). arXiv:2001.04451 https://arxiv.org/abs/2001.04451
  24. The Tensor Algebra Compiler. Proc. ACM Program. Lang. 1, OOPSLA, Article 77 (Oct. 2017), 29 pages. https://doi.org/10.1145/3133901
  25. ChatGPT and large language models in academia: opportunities and challenges. BioData Mining 16, 1 (2023), 20.
  26. PolyMage: Automatic Optimization for Image Processing Pipelines. SIGPLAN Not. 50, 4 (mar 2015), 429–443. https://doi.org/10.1145/2775054.2694364
  27. Sampled Dense Matrix Multiplication for High-Performance Machine Learning. In 2018 IEEE 25th International Conference on High Performance Computing (HiPC). 32–41. https://doi.org/10.1109/HiPC.2018.00013
  28. NVIDIA. 2020. Ampere Architecture. https://www.nvidia.com/en-us/data-center/ampere-architecture/
  29. Accelerating inference with sparsity using the Nvidia ampere architecture and NVIDIA TENSORRT. https://developer.nvidia.com/blog/accelerating-inference-with-sparsity-using-ampere-and-tensorrt/
  30. SCROLLS: Standardized CompaRison Over Long Language Sequences. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, Abu Dhabi, United Arab Emirates, 12007–12021. https://aclanthology.org/2022.emnlp-main.823
  31. Long Range Arena: A Benchmark for Efficient Transformers. arXiv:2011.04006 [cs.LG] https://arxiv.org/abs/2011.04006
  32. Triton: An Intermediate Language and Compiler for Tiled Neural Network Computations. In Proceedings of the 3rd ACM SIGPLAN International Workshop on Machine Learning and Programming Languages (Phoenix, AZ, USA) (MAPL 2019). Association for Computing Machinery, New York, NY, USA, 10–19. https://doi.org/10.1145/3315508.3329973
  33. Tensor Comprehensions: Framework-Agnostic High-Performance Machine Learning Abstractions. CoRR abs/1802.04730 (2018). arXiv:1802.04730 http://arxiv.org/abs/1802.04730
  34. Attention is All you Need. In Advances in Neural Information Processing Systems, I. Guyon, U. Von Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett (Eds.), Vol. 30. Curran Associates, Inc. https://proceedings.neurips.cc/paper_files/paper/2017/file/3f5ee243547dee91fbd053c1c4a845aa-Paper.pdf
  35. Register Tiling for Unstructured Sparsity in Neural Network Inference. Proc. ACM Program. Lang. 7, PLDI, Article 188 (jun 2023), 26 pages. https://doi.org/10.1145/3591302
  36. SparseTIR: Composable Abstractions for Sparse Compilation in Deep Learning. arXiv:2207.04606 [cs.LG]
  37. Long-Short Transformer: Efficient Transformers for Language and Vision. CoRR abs/2107.02192 (2021). arXiv:2107.02192 https://arxiv.org/abs/2107.02192
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (7)
  1. Ahan Gupta (3 papers)
  2. Yueming Yuan (3 papers)
  3. Devansh Jain (7 papers)
  4. Yuhao Ge (1 paper)
  5. David Aponte (6 papers)
  6. Yanqi Zhou (30 papers)
  7. Charith Mendis (20 papers)

Summary

We haven't generated a summary for this paper yet.

X Twitter Logo Streamline Icon: https://streamlinehq.com

Tweets