Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
80 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Balanced Sparsity for Efficient DNN Inference on GPU (1811.00206v4)

Published 1 Nov 2018 in cs.CV

Abstract: In trained deep neural networks, unstructured pruning can reduce redundant weights to lower storage cost. However, it requires the customization of hardwares to speed up practical inference. Another trend accelerates sparse model inference on general-purpose hardwares by adopting coarse-grained sparsity to prune or regularize consecutive weights for efficient computation. But this method often sacrifices model accuracy. In this paper, we propose a novel fine-grained sparsity approach, balanced sparsity, to achieve high model accuracy with commercial hardwares efficiently. Our approach adapts to high parallelism property of GPU, showing incredible potential for sparsity in the widely deployment of deep learning services. Experiment results show that balanced sparsity achieves up to 3.1x practical speedup for model inference on GPU, while retains the same high model accuracy as fine-grained sparsity.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (5)
  1. Zhuliang Yao (7 papers)
  2. Shijie Cao (20 papers)
  3. Wencong Xiao (10 papers)
  4. Chen Zhang (403 papers)
  5. Lanshun Nie (1 paper)
Citations (87)
Youtube Logo Streamline Icon: https://streamlinehq.com