Accelerating a Triton Fused Kernel for W4A16 Quantized Inference with SplitK work decomposition (2402.00025v2)

Published 5 Jan 2024 in cs.DC and cs.AI

Abstract: We propose an implementation of an efficient fused matrix multiplication kernel for W4A16 quantized inference, where we perform dequantization and GEMM in a fused kernel using a SplitK work decomposition. Our implementation shows improvement for the type of skinny matrix-matrix multiplications found in foundation model inference workloads. In particular, this paper surveys the type of matrix multiplication between a skinny activation matrix and a square weight matrix. Our results show an average of 65% speed improvement on A100, and an average of 124% speed improvement on H100 (with a peak of 295%) for a range of matrix dimensions including those found in a llama-style model, where m < n = k.

References (9)

Summary

We haven't generated a summary for this paper yet.

Summarize Now

Tweets

https://twitter.com/gastronomy/status/1753444273205219414

Accelerating a Triton Fused Kernel for W4A16 Quantized Inference with SplitK work decomposition (2402.00025v2)

Summary

Related Papers

Tweets