Understanding the Effects of Permanent Faults in GPU's Parallelism Management and Control Units (2306.10856v2)

Published 19 Jun 2023 in cs.AR

Abstract: Graphics Processing Units (GPUs) are over-stressed to accelerate High-Performance Computing applications and are used to accelerate Deep Neural Networks in several domains where they have a life expectancy of many years. These conditions expose the GPUs hardware to (premature) aging, causing permanent faults to arise after the usual end-of-manufacturing test. Techniques to assess the impact of permanent faults in GPUs are then strongly required, thus allowing to estimate the reliability risk and to possibly mitigate it. In this paper, we present a method to evaluate the effects of permanent faults affecting the GPU scheduler and control units, which are the most peculiar and stressed resources, along with the first figures that allow quantifying these effects. We characterize over 5.83x10⁵ permanent fault effects in the scheduler and controllers of a gate-level GPU model. Then, we map the observed error categories in software by instrumenting the code of 13 applications and two convolutional neural networks, injecting more than 1.65x10⁵ permanent errors. Our two-level fault injection strategy reduces the evaluation time from hundreds of years of gate-level evaluation to hundreds of hours.We found that faults in the GPU parallelism management units can modify the opcode, the addresses, and the status of thread(s) and warp(s). The large majority (up to 99%) of these hardware permanent errors impacts the running software execution. Errors affecting the instruction operation or resource management hang the code, while 45% of errors in the parallelism management or control-flow induce silent data corruptions.

Citations (7)

View on Semantic Scholar

Summary

The paper characterizes over 580,000 permanent fault effects in GPU parallelism management and control units using a gate-level model and application injection.
Faults in these units can modify thread status, addresses, or opcodes, impacting running software execution.
The study found up to 99% of hardware permanent errors affect software execution, with 45% in control-flow units causing silent data corruption.

The paper "Understanding the Effects of Permanent Faults in GPU's Parallelism Management and Control Units" (2306.10856) explores how permanent faults in GPU parallelism management and control units affect the running software execution. The researchers characterized over $5.83 \times 10^5$ permanent fault effects in the scheduler and controllers of a gate-level GPU model and injected more than $1.65 \times 10^5$ permanent errors while instrumenting the code of 13 applications and two convolutional neural networks.

The paper found that faults in GPU parallelism management units can modify the opcode, the addresses, and the status of threads and warps. A significant majority (up to 99%) of hardware permanent errors impact the running software execution. Specifically, errors affecting the instruction operation or resource management can cause the code to hang, while 45% of errors in parallelism management or control-flow induce silent data corruptions (SDC).

In summary, the research quantifies the impact of permanent faults within a GPU's core control structures, revealing that a large percentage of these faults influence software execution, leading to hangs or silent data corruption.

PDF Markdown

Understanding the Effects of Permanent Faults in GPU's Parallelism Management and Control Units (2306.10856v2)

Summary

Related Papers