Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
139 tokens/sec
GPT-4o
47 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Resilience for Exascale Enabled Multigrid Methods (1501.07400v1)

Published 29 Jan 2015 in cs.CE and cs.DC

Abstract: With the increasing number of components and further miniaturization the mean time between faults in supercomputers will decrease. System level fault tolerance techniques are expensive and cost energy, since they are often based on redundancy. Also classical check-point-restart techniques reach their limits when the time for storing the system state to backup memory becomes excessive. Therefore, algorithm-based fault tolerance mechanisms can become an attractive alternative. This article investigates the solution process for elliptic partial differential equations that are discretized by finite elements. Faults that occur in the parallel geometric multigrid solver are studied in various model scenarios. In a standard domain partitioning approach, the impact of a failure of a core or a node will affect one or several subdomains. Different strategies are developed to compensate the effect of such a failure algorithmically. The recovery is achieved by solving a local subproblem with Dirichlet boundary conditions using local multigrid cycling algorithms. Additionally, we propose a superman strategy where extra compute power is employed to minimize the time of the recovery process.

Citations (6)

Summary

We haven't generated a summary for this paper yet.