Sparse GPU Kernels for Deep Learning (2006.10901v2)

Published 18 Jun 2020 in cs.LG, cs.DC, and stat.ML

Abstract: Scientific workloads have traditionally exploited high levels of sparsity to accelerate computation and reduce memory requirements. While deep neural networks can be made sparse, achieving practical speedups on GPUs is difficult because these applications have relatively moderate levels of sparsity that are not sufficient for existing sparse kernels to outperform their dense counterparts. In this work, we study sparse matrices from deep learning applications and identify favorable properties that can be exploited to accelerate computation. Based on these insights, we develop high-performance GPU kernels for two sparse matrix operations widely applicable in neural networks: sparse matrix-dense matrix multiplication and sampled dense-dense matrix multiplication. Our kernels reach 27% of single-precision peak on Nvidia V100 GPUs. Using our kernels, we demonstrate sparse Transformer and MobileNet models that achieve 1.2-2.1x speedups and up to 12.8x memory savings without sacrificing accuracy.

PDF Abstract

Sparse GPU Kernels for Deep Learning: An Analytical Exposition

The paper "Sparse GPU Kernels for Deep Learning" by Gale et al. addresses the challenges and opportunities presented by exploiting sparsity in deep neural networks (DNNs) to achieve computational efficiency on graphics processing units (GPUs). The focus lies in the development and optimization of GPU kernels specifically targeting operations involving sparse matrices within the context of DNNs, notably sparse matrix--dense matrix multiplication (SpMM) and sampled dense--dense matrix multiplication (SDDMM).

Study and Insights

This research acknowledges that, while scientific workloads have successfully leveraged high sparsity to optimize performance and reduce memory footprint, deep learning frameworks typically exhibit only moderate levels of sparsity. This moderate sparsity is insufficient for existing sparse kernels to outpace dense computations on GPUs effectively. Consequently, the authors conducted a substantial evaluation of sparse matrices within DNNs, capturing critical properties that differentiate them from traditional sparse matrices used in scientific applications. Key findings indicate that DNN matrices are generally less sparse, possess longer average row lengths, and exhibit lower variability in row length distribution, providing a foundation for enhancing their computation through targeted GPU kernels.

Methodological Contributions

Developing high-performance GPU kernels for SpMM and SDDMM constituted the core of this work, achievable through several methodological innovations:

1-Dimensional Tiling: A hierarchical decomposition of computations across processing elements is utilized, facilitating operand reuse and simplifying extensibility.
Subwarp Tiling and Reverse-Offset Memory Alignment: These strategies enable efficient vector memory operations even when data alignments are irregular, thereby maximizing bandwidth utilization.
Row Swizzle Load Balancing: By decoupling the distribution of work from memory parallelization, this technique ensures balanced computational load among processing units, contributing to improved utilization of the GPU's capabilities.

Numerical Results and Performance Analysis

Benchmarks on state-of-the-art V100 GPU architectures noted significant speedups with these custom kernels. For instance, SpMM operations demonstrated geometric mean speedups of 3.58× over Nvidia’s cuSPARSE, with peak throughputs reaching 27.3% of the single-precision theoretical maximum. Similarly, SDDMM advancements exhibited gains, clocking geometric mean speedups of 2.19×. Intriguingly, converting these concepts to mixed precision resulted in a geometric mean speedup of 5.97×, underscoring the adaptability and potential efficiency gains of these kernels in varied precision settings.

Implications for Dense Model Equivalence and Memory Efficiency

By permitting DNN architectures like Transformers and MobileNetV1 to leverage sparsity natively, the applications assessed within this work achieved speedup ranges from 1.2× to 2.1× while maintaining model accuracy. Remarkably, these implementations also resulted in substantial memory savings, with reported reductions up to 12.8×. Practically, these enhancements allow for either deploying more resource-efficient models or training larger architectures within predefined computational budgets, thereby opening avenues for more extensive exploration in AI model training and deployment.

Future Directions

The findings underscore notable gaps in current kernel space exploration, especially regarding structured sparsity exploitation aligned with specialized hardware enhancements like Nvidia's tensor cores. Future investigations might explore bespoke sparse formats for transposed operations or explore architecting kernels that leverage potential advances in GPU memory hierarchies or specialized matrix-multiplication hardware such as matrix cores. Additionally, optimizing contiguous memory access patterns for mixed precision or extending adaptation methods such as block sparsity remain tantalizing research avenues.

By synthesizing a unifying framework for sparse matrix operations targeting deep learning on GPUs, this work sets the stage for ongoing enhancements in computational efficiency and resource optimization in large-scale AI models.

PDF Markdown Bookmark Chat (Pro)

Authors (4)

Trevor Gale (10 papers)
Matei Zaharia (101 papers)
Cliff Young (9 papers)
Erich Elsen (28 papers)

Citations (207)

View on Semantic Scholar