Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
97 tokens/sec
GPT-4o
53 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Accelerating Reduction and Scan Using Tensor Core Units (1811.09736v2)

Published 24 Nov 2018 in cs.PF

Abstract: Driven by deep learning, there has been a surge of specialized processors for matrix multiplication, referred to as TensorCore Units (TCUs). These TCUs are capable of performing matrix multiplications on small matrices (usually 4x4 or 16x16) to accelerate the convolutional and recurrent neural networks in deep learning workloads. In this paper we leverage NVIDIA's TCU to express both reduction and scan with matrix multiplication and show the benefits -- in terms of program simplicity, efficiency, and performance. Our algorithm exercises the NVIDIA TCUs which would otherwise be idle, achieves 89%-98% of peak memory copy bandwidth, and is orders of magnitude faster (up to 100x for reduction and 3x for scan) than state-of-the-art methods for small segment sizes -- common in machine learning and scientific applications. Our algorithm achieves this while decreasing the power consumption by up to 22% for reduction and16%for scan.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (5)
  1. Abdul Dakkak (11 papers)
  2. Cheng Li (1094 papers)
  3. Isaac Gelado (3 papers)
  4. Jinjun Xiong (118 papers)
  5. Wen-mei Hwu (62 papers)
Citations (82)

Summary

We haven't generated a summary for this paper yet.