Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
110 tokens/sec
GPT-4o
56 tokens/sec
Gemini 2.5 Pro Pro
44 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Benchmarking the cost of thread divergence in CUDA (1504.01650v1)

Published 7 Apr 2015 in cs.DC

Abstract: All modern processors include a set of vector instructions. While this gives a tremendous boost to the performance, it requires a vectorized code that can take advantage of such instructions. As an ideal vectorization is hard to achieve in practice, one has to decide when different instructions may be applied to different elements of the vector operand. This is especially important in implicit vectorization as in NVIDIA CUDA Single Instruction Multiple Threads (SIMT) model, where the vectorization details are hidden from the programmer. In order to assess the costs incurred by incompletely vectorized code, we have developed a micro-benchmark that measures the characteristics of the CUDA thread divergence model on different architectures focusing on the loops performance.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (2)
  1. Piotr Bialas (12 papers)
  2. Adam Strzelecki (3 papers)
Citations (26)

Summary

We haven't generated a summary for this paper yet.

X Twitter Logo Streamline Icon: https://streamlinehq.com