Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
144 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
45 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Fast and Scalable FFT-Based GPU-Accelerated Algorithms for Hessian Actions Arising in Linear Inverse Problems Governed by Autonomous Dynamical Systems (2407.13066v1)

Published 18 Jul 2024 in math.NA and cs.NA

Abstract: We present an efficient and scalable algorithm for performing matrix-vector multiplications ("matvecs") for block Toeplitz matrices. Such matrices, which are shift-invariant with respect to their blocks, arise in the context of solving inverse problems governed by autonomous systems, and time-invariant systems in particular. In this article, we consider inverse problems that are solved for inferring unknown parameters from observational data of a linear time-invariant dynamical system given in the form of partial differential equations (PDEs). Matrix-free Newton-conjugate-gradient methods are often the gold standard for solving these inverse problems, but they require numerous actions of the Hessian on a vector. Matrix-free adjoint-based Hessian matvecs require solution of a pair of linearized forward/adjoint PDE solves per Hessian action, which may be prohibitive for large-scale inverse problems, especially when efficient low-rank approximations of the Hessian are not readily available, such as for hyperbolic PDE operators. Time invariance of the forward PDE problem leads to a block Toeplitz structure of the discretized parameter-to-observable (p2o) map defining the mapping from inputs (parameters) to outputs (observables) of the PDEs. This block Toeplitz structure enables us to exploit two key properties: (1) compact storage of the p2o map and its adjoint; and (2) efficient fast Fourier transform (FFT)-based Hessian matvecs. The proposed algorithm is mapped onto large multi-GPU clusters and achieves more than 80 percent of peak bandwidth on an NVIDIA A100 GPU. Excellent weak scaling is shown for up to 48 A100 GPUs. For the targeted problems, the implementation executes Hessian matvecs within fractions of a second, orders of magnitude faster than can be achieved by the conventional matrix-free Hessian matvecs via forward/adjoint PDE solves.

Summary

We haven't generated a summary for this paper yet.