Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
119 tokens/sec
GPT-4o
56 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Characterizing Impacts of Storage Faults on HPC Applications: A Methodology and Insights (2105.12929v2)

Published 27 May 2021 in cs.DC

Abstract: In recent years, the increasing complexity in scientific simulations and emerging demands for training heavy artificial intelligence models require massive and fast data accesses, which urges high-performance computing (HPC) platforms to equip with more advanced storage infrastructures such as solid-state disks (SSDs). While SSDs offer high-performance I/O, the reliability challenges faced by the HPC applications under the SSD-related failures remains unclear, in particular for failures resulting in data corruptions. The goal of this paper is to understand the impact of SSD-related faults on the behaviors of complex HPC applications. To this end, we propose FFIS, a FUSE-based fault injection framework that systematically introduces storage faults into the application layer to model the errors originated from SSDs. FFIS is able to plant different I/O related faults into the data returned from underlying file systems, which enables the investigation on the error resilience characteristics of the scientific file format. We demonstrate the use of FFIS with three representative real HPC applications, showing how each application reacts to the data corruptions, and provide insights on the error resilience of the widely adopted HDF5 file format for the HPC applications.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (9)
  1. Bo Fang (26 papers)
  2. Daoce Wang (8 papers)
  3. Sian Jin (32 papers)
  4. Quincey Koziol (6 papers)
  5. Zhao Zhang (250 papers)
  6. Qiang Guan (40 papers)
  7. Suren Byna (15 papers)
  8. Sriram Krishnamoorthy (74 papers)
  9. Dingwen Tao (60 papers)
Citations (4)

Summary

We haven't generated a summary for this paper yet.