Coded Computing for Fault-Tolerant Parallel QR Decomposition (2311.11943v1)

Published 20 Nov 2023 in cs.DC, cs.IT, cs.SY, eess.SY, and math.IT

Abstract: QR decomposition is an essential operation for solving linear equations and obtaining least-squares solutions. In high-performance computing systems, large-scale parallel QR decomposition often faces node faults. We address this issue by proposing a fault-tolerant algorithm that incorporates `coded computing' into the parallel Gram-Schmidt method, commonly used for QR decomposition. Coded computing introduces error-correcting codes into computational processes to enhance resilience against intermediate failures. While traditional coding strategies cannot preserve the orthogonality of $Q$, recent work has proven a post-orthogonalization condition that allows low-cost restoration of the degraded orthogonality. In this paper, we construct a checksum-generator matrix for multiple-node failures that satisfies the post-orthogonalization condition and prove that our code satisfies the maximum-distance separable (MDS) property with high probability. Furthermore, we consider in-node checksum storage setting where checksums are stored in original nodes. We obtain the minimal number of checksums required to be resilient to any $f$ failures under the in-node checksum storage, and also propose an in-node systematic MDS coding strategy that achieves the lower bound. Extensive experiments validate our theories and showcase the negligible overhead of our coded computing framework for fault-tolerant QR decomposition.

Citations (2)

View on Semantic Scholar

Summary

We haven't generated a summary for this paper yet.

Summarize Now

Coded Computing for Fault-Tolerant Parallel QR Decomposition (2311.11943v1)

Summary

Related Papers