On the Parallel I/O Optimality of Linear Algebra Kernels: Near-Optimal Matrix Factorizations (2108.09337v2)

Published 20 Aug 2021 in cs.DC, cs.CC, and cs.PF

Abstract: Matrix factorizations are among the most important building blocks of scientific computing. State-of-the-art libraries, however, are not communication-optimal, underutilizing current parallel architectures. We present novel algorithms for Cholesky and LU factorizations that utilize an asymptotically communication-optimal 2.5D decomposition. We first establish a theoretical framework for deriving parallel I/O lower bounds for linear algebra kernels, and then utilize its insights to derive Cholesky and LU schedules, both communicating N^{3/(P*sqrt(M))} elements per processor, where M is the local memory size. The empirical results match our theoretical analysis: our implementations communicate significantly less than Intel MKL, SLATE, and the asymptotically communication-optimal CANDMC and CAPITAL libraries. Our code outperforms these state-of-the-art libraries in almost all tested scenarios, with matrix sizes ranging from 2,048 to 262,144 on up to 512 CPU nodes of the Piz Daint supercomputer, decreasing the time-to-solution by up to three times. Our code is ScaLAPACK-compatible and available as an open-source library.

Authors (11)

Marko Kabić (5 papers)
Tal Ben-Nun (53 papers)
Alexandros Nikolaos Ziogas (16 papers)
Jens Eirik Saethre (1 paper)
André Gaillard (1 paper)
Timo Schneider (18 papers)
Maciej Besta (66 papers)
Anton Kozhevnikov (9 papers)
Joost VandeVondele (10 papers)
Torsten Hoefler (203 papers)
Grzegorz Kwasniewski (15 papers)

Citations (15)

View on Semantic Scholar

Summary

We haven't generated a summary for this paper yet.

Summarize Now

On the Parallel I/O Optimality of Linear Algebra Kernels: Near-Optimal Matrix Factorizations (2108.09337v2)

Summary

Related Papers