Serial assembly operations as a bottleneck for LL, RL, and MF with multithreaded BLAS

Determine whether the observed performance degradation of the left-looking (LL), right-looking (RL), and multifrontal (MF) serial supernodal sparse Cholesky factorization algorithms when using Intel’s MKL multithreaded BLAS is primarily caused by their assembly operations being executed serially, in contrast to the right-looking blocked (RLB) algorithm that performs all floating-point work within multithreaded BLAS kernels and avoids assembly.

Background

In the experimental evaluation on 21 large matrices with Intel’s MKL multithreaded BLAS, the right-looking blocked (RLB) method was the fastest for every case, while LL, RL, and MF incurred 20–80% longer runtimes. RLB performs no assembly operations and executes all floating-point work via multithreaded BLAS calls (DSYRK, DGEMM).

By contrast, LL, RL, and MF maintain and assemble update matrices as part of their sparse computations, a process implemented serially in the presented algorithms. The authors hypothesize that this serial assembly stage limits parallel efficiency under multithreaded BLAS, explaining RLB’s consistent advantage.

References

We conjecture that the performance of~LL, RL, and~MF suffers seriously due to the fact that the assembly operations are performed serially.

— Some new techniques to use in serial sparse Cholesky factorization algorithms (2409.13090 - Karsavuran et al., 19 Sep 2024) in Section 3.2 (Results)

Serial assembly operations as a bottleneck for LL, RL, and MF with multithreaded BLAS

Background

References

Related Problems