Papers
Topics
Authors
Recent
2000 character limit reached

Algorithm-Based Fault Tolerance (ABFT)

Updated 3 January 2026
  • Algorithm-Based Fault Tolerance (ABFT) is a technique that embeds checksums into numerical algorithms to detect and correct both transient soft errors and permanent faults.
  • It enhances reliability in high-performance and distributed computing by exploiting mathematical invariants in operations like matrix multiplication, stencils, and neural network inference.
  • ABFT extends from classical dense linear algebra to modern applications in ML and DNN accelerators, enabling fault tolerance with minimal computational overhead.

Algorithm-Based Fault Tolerance (ABFT) is a class of error-detection and -correction techniques that leverage the mathematical properties and linear invariants intrinsic to numerical algorithms. By embedding checksums or other forms of redundancy at the algorithmic level, ABFT methods enable cost-effective and scalable mitigation of both transient soft errors and permanent hardware faults, especially in high-performance, distributed, and scientific computing contexts. ABFT stands in contrast to hardware-centric methods such as ECC or system-level checkpoint/restart by achieving low-overhead, fine-grained resilience directly in the numeric computation workflow.

1. Mathematical Principles and Classical Schemes

The foundational concept of ABFT is the augmentation of linear algebra kernels—especially matrix–matrix multiplication and related dense operations—with lightweight linear checksums that propagate through computation due to the underlying algebraic structure.

Given a matrix product C=ABC = A B with ARm×kA \in \mathbb{R}^{m \times k}, BRk×nB \in \mathbb{R}^{k \times n}, ABFT augments AA and BB by checksum rows/columns (typically all-ones vectors) to yield AA^* and BB^*. The resulting product C=ABC^* = A^* B^* contains both the desired result CC and checksum blocks. Error detection is performed by verifying invariants such as CvuC\|C \cdot v - u^\top \cdot C\| for all-ones vectors uu, vv. Single-error correction is possible under classical ABFT when the error localizes to a unique row/column intersection. In the presence of multiple errors, multi-weighted checksum schemes (Vandermonde-based encodings) enhance error locality and tolerance, but add computational overhead (Zhai et al., 2021).

The forward error model for soft error detection is given by

δr,i=row-checkijCij,δc,j=col-checkjiCij\delta_{r,i} = \text{row-check}_i - \sum_{j} C_{ij},\quad \delta_{c,j} = \text{col-check}_j - \sum_{i} C_{ij}

requiring only O(n)O(n) additional cost for an n2n^2 multiply. If single-error assumption holds, error localization is immediate; otherwise, further recovery strategies are necessary (Xue et al., 2023).

2. ABFT in Distributed and Parallel Numerical Algorithms

ABFT schemes generalize to parallel and distributed-memory environments by appropriately partitioning and replicating checksums across nodes, processes, or tiles. In distributed dense matrix multiplication (PDGEMM) on a p×p\sqrt{p}\times\sqrt{p} process grid, additional rows and columns store blockwise checksums, with the global redundancy structure preserved through all communication and computation phases (0806.3121). A lost or corrupted block is recovered by solving a linear system derived from checksum equations across the respective row and column, typically with negligible computational and communication cost.

Communication-avoiding algorithms such as TSQR can exploit inherent computational redundancy for ABFT purposes. In redundant TSQR (rTSQR), at each reduction-tree step, all partners exchange their RR factors and both compute the next QR, yielding 2s2^s identical copies after step ss and thereby tolerating up to 2s12^s-1 process failures at that stage. Various failure handling semantics—BLANK (hole-based, abort), SHRINK (shrink communicator), and REBUILD (process respawn)—are supported. The method approximately incurs an O(logP)O(\log P) overhead in both computation and messages compared to baseline (Coti, 2015).

Recent schemes scale to exascale: HRBR (Hot Replacement with Background Recovery) for HPL maintains checksum columns to enable on-the-fly hot-swapping of failed processes with redundancy processes, and relies on a background, accelerated rebuild to restore used redundancy. This approach achieves time efficiency EHRBR>0.88E_\text{HRBR} > 0.88 at p=106p=10^6, contrasted with significantly lower efficiency for checkpoint/restart at such scales (Yao et al., 2011).

3. Algorithmic Error Detection and Correction in Modern Kernels

ABFT applies beyond matrix multiplication to diverse HPC workloads. In stencil computations, one-dimensional row and column checksums are maintained per iteration, with theorems establishing that comparing the "direct" (summed over output) and "interpolated" (propagated through the stencil pattern) checksums provides complete detection and closed-form correction of single silent data corruptions (SDCs). Online ABFT applies checksums at every step, allowing immediate correction (but with finite floating-point vulnerabilities for specific bit positions). Offline variants checkpoint every Δ\Delta steps and compare checksums periodically, rolling back if discrepancies are found. The method introduces less than 8% runtime overhead in production 3D codes (Cavelan et al., 2019).

For sparse iterative algorithms, such as preconditioned conjugate gradient (PCG), ABFT uses shifting checksums on SpMV and auxiliary vector operations, extrinsically supporting both detection and single-error correction, thus enabling forward progress without rollback unless multiple simultaneous errors occur. An abstract model guides optimal checkpoint intervals (Fasi et al., 2015).

4. ABFT in Contemporary ML and DNN Accelerators

The emergence of DNN inference in unreliable, low-voltage, and energy-constrained environments has catalyzed ABFT development for neural workloads. Traditional ABFT on GEMM is extended to deep learning via both (i) global per-kernel checksums and (ii) thread-level ABFT, which leverages register-resident checksums in fine-grained tiles/panels in high-arithmetic-intensity kernels (e.g., Tensor Core GEMM units). Layerwise selection guided by per-layer arithmetic intensity versus hardware compute-to-memory ratio yields 1.09–5.3× reduction in overhead (Kosaian et al., 2021).

In convolutional neural networks, ABFT is realized through systematic input and output checksums, exploiting the convolution’s distributive law to support correction of single-row/column faults and full single-block location; a hierarchically integrated workflow combines minimal-cost detection (CoC-D) with layer-tuned recovery (FC, RC, ClC), maintaining an upper-bound of 8% runtime overhead across major ImageNet models (Zhao et al., 2020).

For transformer and attention workloads, classical ABFT is extended to encompass entire attention blocks, including the intervening softmax normalization. Flash-ABFT fuses ABFT checksums over the whole QKsoftmaxVQK^\top \to \mathrm{softmax} \to V pipeline by deriving an attention-specific online checksum. In hardware, this absorbs only a small area (5.3%) and energy (<1.9%) overhead, yet attains ~97%97\% detection rates with minimal false positives (Titopoulos et al., 22 Jul 2025). FT-Transformer’s approach uses architecture-aware tensor checksums to match Tensor Core organization and fuses all checks into a single verification, achieving >90% error coverage for matrix ops and >97% for softmax, while reducing overheads and improving speed by up to 7.56×7.56\times versus traditional decoupled ABFT (Dai et al., 3 Apr 2025).

Graph convolutional network layers benefit from GCN-ABFT, which exploits associativity in SHWSHW to reduce checksum cost by 21% (on the check stage) and preserve >97% error detection accuracy (Peltekis et al., 2024).

5. Adaptivity, Approximation, and Statistical ABFT

Emerging research addresses the over-stringency of "exact" ABFT in DNNs, which triggers recovery for minuscule, application-insignificant deviations. ApproxABFT proposes relaxation via adaptive, empirically determined magnitude and localization thresholds (e.g., TdetT_\text{det} for overall mean-squared deviation, TlocT_\text{loc} for row/column deviation), together with granularity optimization and inter-layer sensitivity balancing. This strategy enables selective fault correction only for "critical" errors, reduces unnecessary overhead by up to 80%, and extends maximum tolerable BER by orders of magnitude in vision transformers (Xue et al., 2023).

ReaLM systematizes the statistical ABFT philosophy for LLM inference accelerators. A critical region in (log2mag,log2freq)(\log_2\mathrm{mag}, \log_2 \mathrm{freq}) space is fitted for each GEMM, so that only errors with frequency or magnitude exceeding adapted thresholds trigger re-execution. Customized error-detection circuits enable fast online accumulation of error statistics with only 1.42% area/1.79% power overhead. Experimentally, this yields up to 35.8% energy savings and keeps perplexity degradation below 0.3 (vs. 18.54 unconstrained, and 2.0 in exact classical ABFT), demonstrating that statistical ABFT unlocks a new reliability–efficiency Pareto front (Xie et al., 31 Mar 2025).

6. Practical Considerations and Comparative Overheads

In all ABFT applications, effective deployment requires careful integration with domain-specific data layouts, hardware characteristics (e.g., memory vs. compute bound), error models (soft vs. fail-stop), and resilience goals (detection-only vs. correction, zero-tolerance vs. application-aware relaxation). Overheads for state-of-the-art ABFT implementations in HPC matrix-multiply are typically within 2–15% (ABFT in distributed dense BLAS-3 (0806.3121), Level-3 BLAS (Zhai et al., 2021), FFT (Wu et al., 2024)), with overhead per node decreasing as system scale grows due to favorable weak and strong scaling properties. Recent ML-specific ABFT schemes for large models and attention achieve application-level reliability improvements with minimal penalty, and emerging adaptivity further closes the gap with hardware-only approaches.

ABFT research has systematically expanded from matrix–matrix and vector kernels to a wide spectrum of computational patterns: Cholesky/QR, stencils, convolutions, attention blocks, graph convolutions, FFT, and even iterative solvers and checkpoint/restart workflows. Its core methodology—exploiting algebraic invariants for rapid, low-overhead error detection and, when possible, correction—remains central. Theoretical results guarantee single-error detection and correction for classical schemes, near-optimal scalability, and, for certain workloads, bounded probability of global failure even in the presence of persistent and cascading faults (Coti, 2015, Yao et al., 2011).

Limitations persist: detection coverage may degrade under dense or correlated failures, or when hardware platforms impose constraints on synchronized reduction, register pressure, or memory protection. Nevertheless, ABFT continues to be a primary enabler of reliable, high-throughput scientific and ML workloads under demanding resilience and efficiency constraints.

Whiteboard

Topic to Video (Beta)

Follow Topic

Get notified by email when new papers are published related to Algorithm-Based Fault Tolerance (ABFT).

Don't miss out on important new AI/ML research

See which papers are being discussed right now on X, Reddit, and more:

“Emergent Mind helps me see which AI papers have caught fire online.”

Philip

Philip

Creator, AI Explained on YouTube