Optimal factor storage (sparse vs dense) for explicit GPU assembly in 3D FETI

Determine, for 3D finite element meshes during explicit GPU-based assembly of the local dual operator \tilde{\Lambda}_i = \tilde{B}_i K_{i,reg}^{-1} K_{i,reg}^{-\top} \tilde{B}_i^\top in the FETI solver, whether storing the Cholesky factors of K_{i,reg} in sparse format (cuSPARSE TRSM using CSR/CSC) or in dense format (cuBLAS TRSM using column-major storage) yields superior performance, and delineate the subdomain-size and sparsity regimes under which each choice is optimal.

Background

The paper studies explicit assembly of the FETI dual operator using GPUs and evaluates multiple parameter choices affecting performance, including whether to use sparse or dense storage for the triangular factors when invoking TRSM kernels. While the decision is straightforward for 2D problems (sparse factors are best), the situation for 3D meshes is more complex due to higher fill-in and differing behavior across CUDA library versions.

In legacy cuSPARSE, sparse TRSM can perform well thanks to a block algorithm, but for 3D meshes with denser factors the authors observe no clear winner between sparse and dense storage, with performance depending on subdomain size and factor sparsity. They therefore recommend benchmarking to select the storage format, indicating an unresolved decision boundary for practical configurations.

References

As can be observed in the graph, for 3D meshes where the factors are denser, it is unclear which is the better option.

— Assembly of FETI dual operator using CUDA (2502.08382 - Homola et al., 12 Feb 2025) in Section “Optimal parameters of the assembly,” Factor storage paragraph (Results)

Optimal factor storage (sparse vs dense) for explicit GPU assembly in 3D FETI

Background

References

Related Problems