Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
194 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
45 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Accelerating a Triton Fused Kernel for W4A16 Quantized Inference with SplitK work decomposition (2402.00025v2)

Published 5 Jan 2024 in cs.DC and cs.AI

Abstract: We propose an implementation of an efficient fused matrix multiplication kernel for W4A16 quantized inference, where we perform dequantization and GEMM in a fused kernel using a SplitK work decomposition. Our implementation shows improvement for the type of skinny matrix-matrix multiplications found in foundation model inference workloads. In particular, this paper surveys the type of matrix multiplication between a skinny activation matrix and a square weight matrix. Our results show an average of 65% speed improvement on A100, and an average of 124% speed improvement on H100 (with a peak of 295%) for a range of matrix dimensions including those found in a llama-style model, where m < n = k.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (9)
  1. “cuBLAS” URL: https://docs.nvidia.com/cuda/cublas/index.html
  2. “cutlass/media/docs/efficient_gemm.md at main · NVIDIA/cutlass” URL: https://github.com/NVIDIA/cutlass/blob/main/media/docs/efficient_gemm.md
  3. “foundation-model-stack/triton at triton · foundation-model-stack/foundation-model-stack” URL: https://github.com/foundation-model-stack/foundation-model-stack/tree/triton/triton
  4. “Nsight Compute” Archive Location: Nsight Compute URL: https://docs.nvidia.com/nsight-compute/NsightCompute/index.html
  5. “NVIDIA Ampere Architecture” URL: https://resources.nvidia.com/en-us-genomics-ep/ampere-architecture-white-paper
  6. “NVIDIA H100 Tensor Core GPU Architecture Overview” URL: https://resources.nvidia.com/en-us-tensor-core
  7. “Stream-K: Work-centric Parallel Decomposition for Dense Matrix-Matrix Multiplication on the GPU” arXiv, 2023 arXiv: http://arxiv.org/abs/2301.03598
  8. “CUTLASS” original-date: 2017-11-30T00:11:24Z, 2023 URL: https://github.com/NVIDIA/cutlass
  9. “triton/python/triton/ops/matmul.py at main · openai/triton” URL: https://github.com/openai/triton/blob/main/python/triton/ops/matmul.py

Summary

We haven't generated a summary for this paper yet.

X Twitter Logo Streamline Icon: https://streamlinehq.com