2000 character limit reached
Accelerating a Triton Fused Kernel for W4A16 Quantized Inference with SplitK work decomposition (2402.00025v2)
Published 5 Jan 2024 in cs.DC and cs.AI
Abstract: We propose an implementation of an efficient fused matrix multiplication kernel for W4A16 quantized inference, where we perform dequantization and GEMM in a fused kernel using a SplitK work decomposition. Our implementation shows improvement for the type of skinny matrix-matrix multiplications found in foundation model inference workloads. In particular, this paper surveys the type of matrix multiplication between a skinny activation matrix and a square weight matrix. Our results show an average of 65% speed improvement on A100, and an average of 124% speed improvement on H100 (with a peak of 295%) for a range of matrix dimensions including those found in a llama-style model, where m < n = k.
- “cuBLAS” URL: https://docs.nvidia.com/cuda/cublas/index.html
- “cutlass/media/docs/efficient_gemm.md at main · NVIDIA/cutlass” URL: https://github.com/NVIDIA/cutlass/blob/main/media/docs/efficient_gemm.md
- “foundation-model-stack/triton at triton · foundation-model-stack/foundation-model-stack” URL: https://github.com/foundation-model-stack/foundation-model-stack/tree/triton/triton
- “Nsight Compute” Archive Location: Nsight Compute URL: https://docs.nvidia.com/nsight-compute/NsightCompute/index.html
- “NVIDIA Ampere Architecture” URL: https://resources.nvidia.com/en-us-genomics-ep/ampere-architecture-white-paper
- “NVIDIA H100 Tensor Core GPU Architecture Overview” URL: https://resources.nvidia.com/en-us-tensor-core
- “Stream-K: Work-centric Parallel Decomposition for Dense Matrix-Matrix Multiplication on the GPU” arXiv, 2023 arXiv: http://arxiv.org/abs/2301.03598
- “CUTLASS” original-date: 2017-11-30T00:11:24Z, 2023 URL: https://github.com/NVIDIA/cutlass
- “triton/python/triton/ops/matmul.py at main · openai/triton” URL: https://github.com/openai/triton/blob/main/python/triton/ops/matmul.py