Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
41 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Analyzing GPU Tensor Core Potential for Fast Reductions (1903.03640v1)

Published 8 Mar 2019 in cs.DC

Abstract: The Nvidia GPU architecture has introduced new computing elements such as the \textit{tensor cores}, which are special processing units dedicated to perform fast matrix-multiply-accumulate (MMA) operations and accelerate \textit{Deep Learning} applications. In this work we present the idea of using tensor cores for a different purpose such as the parallel arithmetic reduction problem, and propose a new GPU tensor-core based algorithm as well as analyze its potential performance benefits in comparison to a traditional GPU-based one. The proposed method, encodes the reduction of $n$ numbers as a set of $m\times m$ MMA tensor-core operations (for Nvidia's Volta architecture $m=16$) and takes advantage from the fact that each MMA operation takes just one GPU cycle. When analyzing the cost under a simplified GPU computing model, the result is that the new algorithm manages to reduce a problem of $n$ numbers in $T(n) = 5\log_{m2}(n)$ steps with a speedup of $S = \frac{4}{5}\log_2(m2)$.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (3)
  1. Roberto Carrasco (4 papers)
  2. Raimundo Vega (4 papers)
  3. Cristóbal A. Navarro (21 papers)
Citations (11)